Python package PMLB provides the following user-facing functions and variables. For installation instructions or a high-level overview of how to use PMLB, see Install or Using PMLB.

Functions

fetch_data

function pmlb.fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)

Download a dataset from PMLB, (optionally) store it locally, and return the dataset.

You must be connected to the internet if you are fetching a dataset that is not cached locally.

Parameters:

  • dataset_name : str Name of the dataset to fetch. A list of available datasets can be found in the documentation or via pmlb.get_updated_datasets.
  • return_X_y : bool, default=False Specify the format of the data returned to the user. If False, the data are returned as a pandas.DataFrame, of size (n_samples, n_features+1), where the last column is target - the true target value of the data point. If True, the data are returned as a tuple (of length 2), where the first element is a numpy.ndarray of size (n_samples, n_features), and the second element is a numpy.ndarray of size (n_samples,), corresponding to the feature matrix (usually denoted X) and the list of target values (usually denoted y), respectively.
  • local_cache_dir : str, default=None The directory on your local machine in which to store the data files. If None, no local data cache will be used.
  • dropna : bool, default=True If True, pmlb will drop NAs from the exported dataset.

Returns:

  • dataset : pandas.DataFrame or (array-like, array-like) If return_X_y == False, a pandas.DataFrame containing the fetched dataset. If return_X_y == True, a 2-tuple of numpy.ndarrays containing the feature matrix X and the target array y, respectively.

Example:

from pmlb import fetch_data

X, y = fetch_data('mushroom', return_X_y=True)
X
## array([[2, 0, 7, ..., 1, 4, 6],
##        [0, 3, 9, ..., 0, 2, 0],
##        [2, 3, 8, ..., 0, 3, 0],
##        ...,
##        [2, 0, 8, ..., 7, 2, 0],
##        [2, 3, 8, ..., 7, 3, 0],
##        [3, 2, 0, ..., 7, 4, 6]])
y
## array([0, 0, 0, ..., 0, 0, 1])

Variables

These variables are used to list the datasets that are currently available in PMLB.

dataset_names

variable pmlb.dataset_names

A list of all datasets included in PMLB.

The contents of this variable are equal to the union of pmlb.classification_dataset_names and pmlb.regression_dataset_names.

Example:

from pmlb import dataset_names

dataset_names[:10]
## ['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1', 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_EDM_2_001', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_EDM_2_001', 'Hill_Valley_with_noise', 'Hill_Valley_without_noise', 'adult', 'agaricus_lepiota']

classification_dataset_names

variable pmlb.classification_dataset_names

A list of all classification datasets included in PMLB.

Classification datasets are datasets where the target value for each data point is discrete (rather than continuous).

Example:

from pmlb import classification_dataset_names

classification_dataset_names[:10]
## ['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1', 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_EDM_2_001', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_EDM_2_001', 'Hill_Valley_with_noise', 'Hill_Valley_without_noise', 'adult', 'agaricus_lepiota']

regression_dataset_names

variable pmlb.regression_dataset_names

A list of all regression datasets included in PMLB.

Regression datasets are datasets where the target value for each data point is continuous (rather than discrete).

Example:

from pmlb import regression_dataset_names

regression_dataset_names[:10]
## ['1027_ESL', '1028_SWD', '1029_LEV', '1030_ERA', '1089_USCrime', '1096_FacultySalaries', '1191_BNG_pbc', '1193_BNG_lowbwt', '1196_BNG_pharynx', '1199_BNG_echoMonths']