message("Current working directory: ", getwd())
message("Contents of working directory: ", paste(dir(getwd()), collapse = ", "))Python package PMLB provides the following user-facing functions and variables. For installation instructions or a high-level overview of how to use PMLB, see Install or Using PMLB.
fetch_datafunction
pmlb.fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)
Download a dataset from PMLB, (optionally) store it locally, and return the dataset.
You must be connected to the internet if you are fetching a dataset that is not cached locally.
Parameters:
pmlb.get_updated_datasets.False, the data are
returned as a pandas.DataFrame, of size (n_samples,
n_features+1), where the last column is target - the true
target value of the data point. If True, the data are
returned as a tuple (of length 2), where the first element is a
numpy.ndarray of size (n_samples, n_features), and the
second element is a numpy.ndarray of size (n_samples,),
corresponding to the feature matrix (usually denoted X) and
the list of target values (usually denoted y),
respectively.None, no local data cache will be used.True,
pmlb will drop NAs from the exported
dataset.Returns:
return_X_y == False, a
pandas.DataFrame containing the fetched dataset. If
return_X_y == True, a 2-tuple of
numpy.ndarrays containing the feature matrix X
and the target array y, respectively.Example:
## array([[2, 0, 7, ..., 1, 4, 6],
## [0, 3, 9, ..., 0, 2, 0],
## [2, 3, 8, ..., 0, 3, 0],
## ...,
## [2, 0, 8, ..., 7, 2, 0],
## [2, 3, 8, ..., 7, 3, 0],
## [3, 2, 0, ..., 7, 4, 6]], shape=(8124, 22))
## array([0, 0, 0, ..., 0, 0, 1], shape=(8124,))
These variables are used to list the datasets that are currently available in PMLB.
dataset_namesvariable pmlb.dataset_names
A list of all datasets included in PMLB.
The contents of this variable are equal to the union of
pmlb.classification_dataset_names and
pmlb.regression_dataset_names.
Example:
## ['1027_ESL', '1028_SWD', '1029_LEV', '1030_ERA', '1089_USCrime', '1096_FacultySalaries', '1191_BNG_pbc', '1193_BNG_lowbwt', '1196_BNG_pharynx', '1199_BNG_echoMonths']
classification_dataset_namesvariable pmlb.classification_dataset_names
A list of all classification datasets included in PMLB.
Classification datasets are datasets where the target value for each data point is discrete (rather than continuous).
Example:
## ['GAMETES_Epistasis_2_Way_1000atts_0.4H_EDM_1_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.1H_EDM_1_1', 'GAMETES_Epistasis_2_Way_20atts_0.4H_EDM_1_1', 'GAMETES_Epistasis_3_Way_20atts_0.2H_EDM_1_1', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_EDM_2_001', 'GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_EDM_2_001', 'Hill_Valley_with_noise', 'Hill_Valley_without_noise', '_deprecated_australian', '_deprecated_auto']
regression_dataset_namesvariable pmlb.regression_dataset_names
A list of all regression datasets included in PMLB.
Regression datasets are datasets where the target value for each data point is continuous (rather than discrete).
Example:
## ['1027_ESL', '1028_SWD', '1029_LEV', '1030_ERA', '1089_USCrime', '1096_FacultySalaries', '1191_BNG_pbc', '1193_BNG_lowbwt', '1196_BNG_pharynx', '1199_BNG_echoMonths']