The pmlbr package provides the following user-facing functions and variables in the R environment.
The package works for any recent version of R.
You can install the released version of pmlbr from CRAN with:
install.packages('pmlbr')
Or the development version from GitHub with remotes:
# install.packages('remotes') # uncomment to install remotes
library(remotes)
::install_github("EpistasisLab/pmlbr") remotes
The core function of this package is fetch_data
that
allows us to download data from the PMLB repository. For example:
library(pmlbr)
# Download features and labels for penguins dataset in single data frame
<- fetch_data('penguins')
penguins str(penguins)
## 'data.frame': 333 obs. of 8 variables:
## $ island : int 2 2 2 2 2 2 2 2 2 2 ...
## $ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: int 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
## $ sex : int 1 0 0 0 1 0 1 0 1 1 ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ target : int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
## ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
# Download features and labels for penguins dataset in separate data structures
<- fetch_data('penguins', return_X_y=TRUE)
penguins head(penguins$x) # feature data frame
## island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## 1 2 39.1 18.7 181 3750 1 2007
## 2 2 39.5 17.4 186 3800 0 2007
## 3 2 40.3 18.0 195 3250 0 2007
## 4 2 NA NA NA NA NA 2007
## 5 2 36.7 19.3 193 3450 0 2007
## 6 2 39.3 20.6 190 3650 1 2007
head(penguins$y) # target vector
## [1] 0 0 0 0 0 0
Let’s check other available datasets and their summary statistics:
# Dataset names
head(classification_dataset_names, 9)
## [1] "adult" "agaricus_lepiota" "allbp"
## [4] "allhyper" "allhypo" "allrep"
## [7] "analcatdata_aids" "analcatdata_asbestos" "analcatdata_authorship"
head(regression_dataset_names, 9)
## [1] "1027_ESL" "1028_SWD" "1029_LEV"
## [4] "1030_ERA" "1089_USCrime" "1096_FacultySalaries"
## [7] "1191_BNG_pbc" "1193_BNG_lowbwt" "1196_BNG_pharynx"
# Dataset summaries
head(summary_stats)
## dataset n_instances n_features n_binary_features
## 1 1027_ESL 488 4 0
## 2 1028_SWD 1000 10 0
## 3 1029_LEV 1000 4 0
## 4 1030_ERA 1000 4 0
## 5 1089_USCrime 47 13 0
## 6 1096_FacultySalaries 50 4 0
## n_categorical_features n_continuous_features endpoint_type n_classes
## 1 0 4 continuous 9
## 2 0 10 continuous 4
## 3 0 4 continuous 5
## 4 0 4 continuous 9
## 5 0 13 continuous 42
## 6 0 4 continuous 39
## imbalance task
## 1 0.099363200 regression
## 2 0.108290667 regression
## 3 0.111245000 regression
## 4 0.031251250 regression
## 5 0.002970111 regression
## 6 0.004063158 regression
Selecting a subset of datasets based on characteristics that satisfy
certain conditions is straight forward with dplyr
. For
example, if we need datasets with fewer than 100 observations for a
classification task:
library(dplyr)
%>%
summary_stats filter(n_instances < 100, task == 'classification') %>%
pull(dataset)
## [1] "analcatdata_aids" "analcatdata_asbestos"
## [3] "analcatdata_bankruptcy" "analcatdata_cyyoung8092"
## [5] "analcatdata_cyyoung9302" "analcatdata_fraud"
## [7] "analcatdata_happiness" "analcatdata_japansolvent"
## [9] "confidence" "labor"
## [11] "lupus" "parity5"
## [13] "postoperative_patient_data"
You can also find datasets that are most similar to your own or one of the PMLB datasets, for example:
nearest_datasets('penguins')
## [1] "penguins" "ecoli" "schizo" "bupa"
## [5] "solar_flare_1"
?fetch_data ?nearest_datasets