PMLB provides the following user-facing functions and variables. For installation instructions or a high-level overview of how to use PMLB, see Install or Using PMLB.
fetch_data
function pmlb.fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)
Download a dataset from PMLB, (optionally) store it locally, and return the dataset.
You must be connected to the internet if you are fetching a dataset that is not cached locally.
Parameters:
pmlb.get_updated_datasets
.False
, the data are returned as a pandas.DataFrame
, of size (n_samples, n_features+1), where the last column is target
- the true target value of the data point. If True
, the data are returned as a tuple (of length 2), where the first element is a numpy.ndarray
of size (n_samples, n_features), and the second element is a numpy.ndarray
of size (n_samples,), corresponding to the feature matrix (usually denoted X
) and the list of target values (usually denoted y
), respectively.None
, no local data cache will be used.True
, pmlb
will drop NA
s from the exported dataset.Returns:
return_X_y == False
, a pandas.DataFrame
containing the fetched dataset. If return_X_y == True
, a 2-tuple of numpy.ndarray
s containing the feature matrix X
and the target array y
, respectively.Example:
In [1]: from pmlb import fetch_data
In [2]: X, y = fetch_data('mushroom', return_X_y=True)
In [3]: X
Out[3]:
array([[2, 0, 7, ..., 1, 4, 6],
[0, 3, 9, ..., 0, 2, 0],
[2, 3, 8, ..., 0, 3, 0],
...,
[2, 0, 8, ..., 7, 2, 0],
[2, 3, 8, ..., 7, 3, 0],
[3, 2, 0, ..., 7, 4, 6]], dtype=int64)
In [4]: y
Out[4]: array([0, 0, 0, ..., 0, 0, 1], dtype=int64)
get_updated_datasets
function pmlb.get_updated_datasets()
Looks at most recent commit and returns a list of datasets that were updated since the previous commit.
This function is mainly used for generating profiling reports of the datasets.
Returns:
Example:
These variables are used to list the datasets that are currently available in PMLB.
dataset_names
variable pmlb.dataset_names
A list of all datasets included in PMLB.
The contents of this variable are equal to the union of pmlb.classification_dataset_names
and pmlb.regression_dataset_names
.
Example:
classification_dataset_names
variable pmlb.classification_dataset_names
A list of all classification datasets included in PMLB.
Classification datasets are datasets where the target value for each data point is discrete (rather than continuous).
Example:
In [1]: from pmlb import classification_dataset_names
In [2]: classification_dataset_names
Out[2]:
['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1',
'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1',
'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1',
'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1',
...
'yeast']
regression_dataset_names
variable pmlb.regression_dataset_names
A list of all regression datasets included in PMLB.
Regression datasets are datasets where the target value for each data point is continuous (rather than discrete).
Example: