PMLB provides the following user-facing functions and variables. For installation instructions or a high-level overview of how to use PMLB, see Install or Using PMLB.

Functions

fetch_data

function pmlb.fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)

Download a dataset from PMLB, (optionally) store it locally, and return the dataset.

You must be connected to the internet if you are fetching a dataset that is not cached locally.

Parameters:

  • dataset_name : str Name of the dataset to fetch. A list of available datasets can be found in the documentation or via pmlb.get_updated_datasets.
  • return_X_y : bool, default=False Specify the format of the data returned to the user. If False, the data are returned as a pandas.DataFrame, of size (n_samples, n_features+1), where the last column is target - the true target value of the data point. If True, the data are returned as a tuple (of length 2), where the first element is a numpy.ndarray of size (n_samples, n_features), and the second element is a numpy.ndarray of size (n_samples,), corresponding to the feature matrix (usually denoted X) and the list of target values (usually denoted y), respectively.
  • local_cache_dir : str, default=None The directory on your local machine in which to store the data files. If None, no local data cache will be used.
  • dropna : bool, default=True If True, pmlb will drop NAs from the exported dataset.

Returns:

  • dataset : pandas.DataFrame or (array-like, array-like) If return_X_y == False, a pandas.DataFrame containing the fetched dataset. If return_X_y == True, a 2-tuple of numpy.ndarrays containing the feature matrix X and the target array y, respectively.

Example:

In [1]: from pmlb import fetch_data

In [2]: X, y = fetch_data('mushroom', return_X_y=True)

In [3]: X
Out[3]: 
array([[2, 0, 7, ..., 1, 4, 6],
       [0, 3, 9, ..., 0, 2, 0],
       [2, 3, 8, ..., 0, 3, 0],
       ...,
       [2, 0, 8, ..., 7, 2, 0],
       [2, 3, 8, ..., 7, 3, 0],
       [3, 2, 0, ..., 7, 4, 6]], dtype=int64)

In [4]: y
Out[4]: array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

get_updated_datasets

function pmlb.get_updated_datasets()

Looks at most recent commit and returns a list of datasets that were updated since the previous commit.

This function is mainly used for generating profiling reports of the datasets.

Returns:

  • changed_datasets : set Set of datasets that have been changed since the previous Git commit.

Example:

In [1]: from pmlb import get_updated_datasets

In [2]: get_updated_datasets()
changed datasets: []
Out[2]: []

Variables

These variables are used to list the datasets that are currently available in PMLB.

dataset_names

variable pmlb.dataset_names

A list of all datasets included in PMLB.

The contents of this variable are equal to the union of pmlb.classification_dataset_names and pmlb.regression_dataset_names.

Example:

In [1]: from pmlb import dataset_names

In [2]: dataset_names
Out[2]: 
['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1',
 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1',
 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1',
 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1',
...
 '712_chscase_geyser1']

classification_dataset_names

variable pmlb.classification_dataset_names

A list of all classification datasets included in PMLB.

Classification datasets are datasets where the target value for each data point is discrete (rather than continuous).

Example:

In [1]: from pmlb import classification_dataset_names

In [2]: classification_dataset_names
Out[2]: 
['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1',
 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1',
 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1',
 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1',
...
 'yeast']

regression_dataset_names

variable pmlb.regression_dataset_names

A list of all regression datasets included in PMLB.

Regression datasets are datasets where the target value for each data point is continuous (rather than discrete).

Example:

In [1]: from pmlb import regression_dataset_names

In [2]: regression_dataset_names
Out[2]: 
['1027_ESL',
 '1028_SWD',
 '1029_LEV',
 '1030_ERA',
...
 '712_chscase_geyser1']