Using pmlbr

Install

The pmlbr package provides the following user-facing functions and variables in the R environment.

The package works for any recent version of R.

You can install the released version of pmlbr from CRAN with:

install.packages('pmlbr')

Or the development version from GitHub with remotes:

# install.packages('remotes') # uncomment to install remotes
library(remotes)
remotes::install_github("EpistasisLab/pmlbr")

Usage

Fetch data

The core function of this package is fetch_data that allows us to download data from the PMLB repository. For example:

library(pmlbr)

# Download features and labels for penguins dataset in single data frame
penguins <- fetch_data('penguins')

## Download successful.

str(penguins)

## 'data.frame':    333 obs. of  8 variables:
##  $ island           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : int  1 0 0 0 1 0 1 0 1 1 ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ target           : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
##   ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

# Download features and labels for penguins dataset in separate data structures
penguins <- fetch_data('penguins', return_X_y=TRUE)

## Download successful.

head(penguins$x) # feature data frame

##   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## 1      2           39.1          18.7               181        3750   1 2007
## 2      2           39.5          17.4               186        3800   0 2007
## 3      2           40.3          18.0               195        3250   0 2007
## 4      2             NA            NA                NA          NA  NA 2007
## 5      2           36.7          19.3               193        3450   0 2007
## 6      2           39.3          20.6               190        3650   1 2007

head(penguins$y) # target vector

## [1] 0 0 0 0 0 0

Dataset characteristic

Let’s check other available datasets and their summary statistics:

# Dataset names
head(classification_dataset_names, 9)

## [1] "adult"                  "agaricus_lepiota"       "allbp"                 
## [4] "allhyper"               "allhypo"                "allrep"                
## [7] "analcatdata_aids"       "analcatdata_asbestos"   "analcatdata_authorship"

head(regression_dataset_names, 9)

## [1] "1027_ESL"             "1028_SWD"             "1029_LEV"            
## [4] "1030_ERA"             "1089_USCrime"         "1096_FacultySalaries"
## [7] "1191_BNG_pbc"         "1193_BNG_lowbwt"      "1196_BNG_pharynx"

# Dataset summaries
head(summary_stats)

##                dataset n_instances n_features n_binary_features
## 1             1027_ESL         488          4                 0
## 2             1028_SWD        1000         10                 0
## 3             1029_LEV        1000          4                 0
## 4             1030_ERA        1000          4                 0
## 5         1089_USCrime          47         13                 0
## 6 1096_FacultySalaries          50          4                 0
##   n_categorical_features n_continuous_features endpoint_type n_classes
## 1                      0                     4    continuous         9
## 2                      0                    10    continuous         4
## 3                      0                     4    continuous         5
## 4                      0                     4    continuous         9
## 5                      0                    13    continuous        42
## 6                      0                     4    continuous        39
##     imbalance       task
## 1 0.099363200 regression
## 2 0.108290667 regression
## 3 0.111245000 regression
## 4 0.031251250 regression
## 5 0.002970111 regression
## 6 0.004063158 regression

Selecting a subset of datasets based on characteristics that satisfy certain conditions is straight forward with dplyr. For example, if we need datasets with fewer than 100 observations for a classification task:

library(dplyr)
summary_stats %>% 
  filter(n_instances < 100, task == 'classification') %>% 
  pull(dataset)

##  [1] "analcatdata_aids"           "analcatdata_asbestos"      
##  [3] "analcatdata_bankruptcy"     "analcatdata_cyyoung8092"   
##  [5] "analcatdata_cyyoung9302"    "analcatdata_fraud"         
##  [7] "analcatdata_happiness"      "analcatdata_japansolvent"  
##  [9] "confidence"                 "labor"                     
## [11] "lupus"                      "parity5"                   
## [13] "postoperative_patient_data"

Find nearest datasets

You can also find datasets that are most similar to your own or one of the PMLB datasets, for example:

nearest_datasets('penguins')

## [1] "penguins"      "ecoli"         "schizo"        "bupa"         
## [5] "solar_flare_1"

Further information

?fetch_data
?nearest_datasets