Python interface

message("Current working directory: ", getwd())
message("Contents of working directory: ", paste(dir(getwd()), collapse = ", "))

Python package PMLB provides the following user-facing functions and variables. For installation instructions or a high-level overview of how to use PMLB, see Install or Using PMLB.

Functions

`fetch_data`

function pmlb.fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)

Download a dataset from PMLB, (optionally) store it locally, and return the dataset.

You must be connected to the internet if you are fetching a dataset that is not cached locally.

Parameters:

dataset_name : str Name of the dataset to fetch. A list of available datasets can be found in the documentation or via pmlb.get_updated_datasets.
return_X_y : bool, default=False Specify the format of the data returned to the user. If False, the data are returned as a pandas.DataFrame, of size (n_samples, n_features+1), where the last column is target - the true target value of the data point. If True, the data are returned as a tuple (of length 2), where the first element is a numpy.ndarray of size (n_samples, n_features), and the second element is a numpy.ndarray of size (n_samples,), corresponding to the feature matrix (usually denoted X) and the list of target values (usually denoted y), respectively.
local_cache_dir : str, default=None The directory on your local machine in which to store the data files. If None, no local data cache will be used.
dropna : bool, default=True If True, pmlb will drop NAs from the exported dataset.

Returns:

dataset : pandas.DataFrame or (array-like, array-like) If return_X_y == False, a pandas.DataFrame containing the fetched dataset. If return_X_y == True, a 2-tuple of numpy.ndarrays containing the feature matrix X and the target array y, respectively.

Example:

from pmlb import fetch_data

X, y = fetch_data('mushroom', return_X_y=True)
X

## array([[2, 0, 7, ..., 1, 4, 6],
##        [0, 3, 9, ..., 0, 2, 0],
##        [2, 3, 8, ..., 0, 3, 0],
##        ...,
##        [2, 0, 8, ..., 7, 2, 0],
##        [2, 3, 8, ..., 7, 3, 0],
##        [3, 2, 0, ..., 7, 4, 6]], shape=(8124, 22))

## array([0, 0, 0, ..., 0, 0, 1], shape=(8124,))

Variables

These variables are used to list the datasets that are currently available in PMLB.

`dataset_names`

variable pmlb.dataset_names

A list of all datasets included in PMLB.

The contents of this variable are equal to the union of pmlb.classification_dataset_names and pmlb.regression_dataset_names.

Example:

from pmlb import dataset_names

dataset_names[:10]

## ['1027_ESL', '1028_SWD', '1029_LEV', '1030_ERA', '1089_USCrime', '1096_FacultySalaries', '1191_BNG_pbc', '1193_BNG_lowbwt', '1196_BNG_pharynx', '1199_BNG_echoMonths']

`classification_dataset_names`

variable pmlb.classification_dataset_names

A list of all classification datasets included in PMLB.

Classification datasets are datasets where the target value for each data point is discrete (rather than continuous).

Example:

from pmlb import classification_dataset_names

classification_dataset_names[:10]

## ['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1', 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_EDM_2_001', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_EDM_2_001', 'Hill_Valley_with_noise', 'Hill_Valley_without_noise', '_deprecated_australian', '_deprecated_auto']

`regression_dataset_names`

variable pmlb.regression_dataset_names

A list of all regression datasets included in PMLB.

Regression datasets are datasets where the target value for each data point is continuous (rather than discrete).

Example:

from pmlb import regression_dataset_names

regression_dataset_names[:10]

## ['1027_ESL', '1028_SWD', '1029_LEV', '1030_ERA', '1089_USCrime', '1096_FacultySalaries', '1191_BNG_pbc', '1193_BNG_lowbwt', '1196_BNG_pharynx', '1199_BNG_echoMonths']