Python package PMLB provides the following user-facing functions and variables. For installation instructions or a high-level overview of how to use PMLB, see Install or Using PMLB.
fetch_data
function
pmlb.fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)
Download a dataset from PMLB, (optionally) store it locally, and return the dataset.
You must be connected to the internet if you are fetching a dataset that is not cached locally.
Parameters:
pmlb.get_updated_datasets
.False
, the data are
returned as a pandas.DataFrame
, of size (n_samples,
n_features+1), where the last column is target
- the true
target value of the data point. If True
, the data are
returned as a tuple (of length 2), where the first element is a
numpy.ndarray
of size (n_samples, n_features), and the
second element is a numpy.ndarray
of size (n_samples,),
corresponding to the feature matrix (usually denoted X
) and
the list of target values (usually denoted y
),
respectively.None
, no local data cache will be used.True
,
pmlb
will drop NA
s from the exported
dataset.Returns:
return_X_y == False
, a
pandas.DataFrame
containing the fetched dataset. If
return_X_y == True
, a 2-tuple of
numpy.ndarray
s containing the feature matrix X
and the target array y
, respectively.Example:
from pmlb import fetch_data
= fetch_data('mushroom', return_X_y=True)
X, y X
## array([[2, 0, 7, ..., 1, 4, 6],
## [0, 3, 9, ..., 0, 2, 0],
## [2, 3, 8, ..., 0, 3, 0],
## ...,
## [2, 0, 8, ..., 7, 2, 0],
## [2, 3, 8, ..., 7, 3, 0],
## [3, 2, 0, ..., 7, 4, 6]])
y
## array([0, 0, 0, ..., 0, 0, 1])
These variables are used to list the datasets that are currently available in PMLB.
dataset_names
variable pmlb.dataset_names
A list of all datasets included in PMLB.
The contents of this variable are equal to the union of
pmlb.classification_dataset_names
and
pmlb.regression_dataset_names
.
Example:
from pmlb import dataset_names
10] dataset_names[:
## ['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1', 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_EDM_2_001', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_EDM_2_001', 'Hill_Valley_with_noise', 'Hill_Valley_without_noise', 'adult', 'agaricus_lepiota']
classification_dataset_names
variable pmlb.classification_dataset_names
A list of all classification datasets included in PMLB.
Classification datasets are datasets where the target value for each data point is discrete (rather than continuous).
Example:
from pmlb import classification_dataset_names
10] classification_dataset_names[:
## ['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1', 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_EDM_2_001', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_EDM_2_001', 'Hill_Valley_with_noise', 'Hill_Valley_without_noise', 'adult', 'agaricus_lepiota']
regression_dataset_names
variable pmlb.regression_dataset_names
A list of all regression datasets included in PMLB.
Regression datasets are datasets where the target value for each data point is continuous (rather than discrete).
Example:
from pmlb import regression_dataset_names
10] regression_dataset_names[:
## ['1027_ESL', '1028_SWD', '1029_LEV', '1030_ERA', '1089_USCrime', '1096_FacultySalaries', '1191_BNG_pbc', '1193_BNG_lowbwt', '1196_BNG_pharynx', '1199_BNG_echoMonths']