pmlbr provides the following user-facing functions and variables in the R environment. For installation instructions or a high-level overview of how to use PMLB, please see R interface.

Functions

fetch_data

function fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)

Download a dataset from PMLB, (optionally) store it locally, and return the dataset.

You must be connected to the internet if you are fetching a dataset that is not cached locally.

Parameters:

  • dataset_name : str Name of the dataset to fetch from PMLB. A list of available datasets can be found in the documentation.
  • return_X_y : bool, default=FALSE Specify the format of the data returned to the user. If FALSE, the data are returned as a dataframe of size (n_samples, n_features+1), where the last column is target - the true target value of the data point. If TRUE, the data are returned as a list of length 2, where the first element is a dataframe of size (n_samples, n_features) corresponding to the feature matrix (usually denoted X), and the second element is a numeric vector of length n_samples corresponding to the list of target values (usually denoted y), respectively.
  • local_cache_dir : str, default=None The directory on your local machine in which to store the data files, defaults to NA, indicating cache will not be used.
  • dropna : bool, default=TRUE If TRUE, NAs will be automatically dropped from the exported dataset.

Returns:

  • dataset : dataframe If return_X_y == FALSE, fetched dataset as a dataframe of size (n_samples, n_features+1), where the last column is target. If return_X_y == TRUE, a list of length 2 containing the feature matrix X and the target array y, respectively.

Example:

library(pmlbr)

# Download features and labels for penguins dataset in single data frame
penguins <- fetch_data('penguins')
str(penguins)
## 'data.frame':    333 obs. of  8 variables:
##  $ island           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : int  1 0 0 0 1 0 1 0 1 1 ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ target           : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
##   ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

nearest_datasets

function

nearest_datasets(
  x,
  y = NULL,
  n_neighbors = 5,
  dimensions = c("n_instances", "n_features"),
  task = c("classification", "regression"),
  target_name = "target",
...)

Select nearest datasets given dataset name or dataframe x.

Parameters:

  • x : str or dataframe
    Character string of dataset name from PMLB, or data.frame of n_samples x n_features (or n_features+1 with a target column)

  • y : numeric vector Target column. Required when x does not contain the target column.

  • n_neighbors : integer Number of dataset names to return as neighbors.

  • dimensions : character vector Dataset characteristics to include in similarity calculation. Dimensions must correspond to numeric columns of all_summary_stats.tsv. If ‘all’ (default), uses all numeric columns.

  • target_name : str Character string specifying column of target/dependent variable.

  • task : str Character string specifying classification or regression for summary stat generation.

  • Further arguments passed to each S3 method.

Returns: - dataset : character vector Names of most similar datasets to x, most similar dataset first.

  • dataset : dataframe If return_X_y == FALSE, fetched dataset as a dataframe of size (n_samples, n_features+1), where the last column is target. If return_X_y == TRUE, a list of length 2 containing the feature matrix X and the target array y, respectively.

Examples:

nearest_datasets('penguins')
## [1] "penguins"      "ecoli"         "schizo"        "bupa"         
## [5] "solar_flare_1"
nearest_datasets(fetch_data('penguins'))
## [1] "penguins"      "ecoli"         "schizo"        "bupa"         
## [5] "solar_flare_1"

Data objects

These objects contain details on datasets currently available in PMLB.

dataset_names

A list of all datasets included in PMLB.

The contents of this vector are the union of classification_dataset_names and regression_dataset_names.

Example:

head(dataset_names, 9)
## [1] "1027_ESL"             "1028_SWD"             "1029_LEV"            
## [4] "1030_ERA"             "1089_USCrime"         "1096_FacultySalaries"
## [7] "1191_BNG_pbc"         "1193_BNG_lowbwt"      "1196_BNG_pharynx"

classification_dataset_names

A list of all classification datasets included in PMLB.

Classification datasets are datasets where the target values (dependent variable/outcome) are discrete.

Example:

head(classification_dataset_names, 9)
## [1] "adult"                  "agaricus_lepiota"       "allbp"                 
## [4] "allhyper"               "allhypo"                "allrep"                
## [7] "analcatdata_aids"       "analcatdata_asbestos"   "analcatdata_authorship"

regression_dataset_names

A list of all regression datasets included in PMLB.

regression datasets are datasets where the target values (dependent variable/outcome) are continuous.

Example:

head(regression_dataset_names, 9)
## [1] "1027_ESL"             "1028_SWD"             "1029_LEV"            
## [4] "1030_ERA"             "1089_USCrime"         "1096_FacultySalaries"
## [7] "1191_BNG_pbc"         "1193_BNG_lowbwt"      "1196_BNG_pharynx"

summary_stats

Summary statistics for the all datasets.

A data frame with 10 variables:

  • dataset: Dataset name
  • n_instances: Number of data observations (equal to number of rows)
  • n_features: Total number of features (number of columns - 1)
  • n_binary_features: Number of binary features
  • n_categorical_features: Number of categorical features
  • n_continuous_features: Number of continuous features
  • n_classes: Number of classes in target variable
  • endpoint_type: Value type of endpoint/target (can be binary, categorical or continuous)
  • imbalance: Imbalance measure, where zero means that the dataset is perfectly balanced and the higher the value, the more imbalanced the dataset
  • task: Type of problem/task. Can be classification or regression.

Example:

head(summary_stats)
##                dataset n_instances n_features n_binary_features
## 1             1027_ESL         488          4                 0
## 2             1028_SWD        1000         10                 0
## 3             1029_LEV        1000          4                 0
## 4             1030_ERA        1000          4                 0
## 5         1089_USCrime          47         13                 0
## 6 1096_FacultySalaries          50          4                 0
##   n_categorical_features n_continuous_features endpoint_type n_classes
## 1                      0                     4    continuous         9
## 2                      0                    10    continuous         4
## 3                      0                     4    continuous         5
## 4                      0                     4    continuous         9
## 5                      0                    13    continuous        42
## 6                      0                     4    continuous        39
##     imbalance       task
## 1 0.099363200 regression
## 2 0.108290667 regression
## 3 0.111245000 regression
## 4 0.031251250 regression
## 5 0.002970111 regression
## 6 0.004063158 regression