PMLB provides the following user-facing functions and variables. For installation instructions or a high-level overview of how to use PMLB, see Install or Using PMLB.
fetch_datafunction pmlb.fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)
Download a dataset from PMLB, (optionally) store it locally, and return the dataset.
You must be connected to the internet if you are fetching a dataset that is not cached locally.
Parameters:
pmlb.get_updated_datasets.False, the data are returned as a pandas.DataFrame, of size (n_samples, n_features+1), where the last column is target - the true target value of the data point. If True, the data are returned as a tuple (of length 2), where the first element is a numpy.ndarray of size (n_samples, n_features), and the second element is a numpy.ndarray of size (n_samples,), corresponding to the feature matrix (usually denoted X) and the list of target values (usually denoted y), respectively.None, no local data cache will be used.True, pmlb will drop NAs from the exported dataset.Returns:
return_X_y == False, a pandas.DataFrame containing the fetched dataset. If return_X_y == True, a 2-tuple of numpy.ndarrays containing the feature matrix X and the target array y, respectively.Example:
In [1]: from pmlb import fetch_data
In [2]: X, y = fetch_data('mushroom', return_X_y=True)
In [3]: X
Out[3]:
array([[2, 0, 7, ..., 1, 4, 6],
[0, 3, 9, ..., 0, 2, 0],
[2, 3, 8, ..., 0, 3, 0],
...,
[2, 0, 8, ..., 7, 2, 0],
[2, 3, 8, ..., 7, 3, 0],
[3, 2, 0, ..., 7, 4, 6]], dtype=int64)
In [4]: y
Out[4]: array([0, 0, 0, ..., 0, 0, 1], dtype=int64)get_updated_datasetsfunction pmlb.get_updated_datasets()
Looks at most recent commit and returns a list of datasets that were updated since the previous commit.
This function is mainly used for generating profiling reports of the datasets.
Returns:
Example:
These variables are used to list the datasets that are currently available in PMLB.
dataset_namesvariable pmlb.dataset_names
A list of all datasets included in PMLB.
The contents of this variable are equal to the union of pmlb.classification_dataset_names and pmlb.regression_dataset_names.
Example:
classification_dataset_namesvariable pmlb.classification_dataset_names
A list of all classification datasets included in PMLB.
Classification datasets are datasets where the target value for each data point is discrete (rather than continuous).
Example:
In [1]: from pmlb import classification_dataset_names
In [2]: classification_dataset_names
Out[2]:
['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1',
'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1',
'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1',
'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1',
...
'yeast']regression_dataset_namesvariable pmlb.regression_dataset_names
A list of all regression datasets included in PMLB.
Regression datasets are datasets where the target value for each data point is continuous (rather than discrete).
Example: