pmlbr provides the following user-facing functions and variables in the R environment. For installation instructions or a high-level overview of how to use PMLB, please see R interface.
fetch_data
function
fetch_data(dataset_name, return_X_y=False, local_cache_dir=None, dropna=True)
Download a dataset from PMLB, (optionally) store it locally, and return the dataset.
You must be connected to the internet if you are fetching a dataset that is not cached locally.
Parameters:
FALSE
, the data are
returned as a dataframe of size (n_samples, n_features+1)
,
where the last column is target
- the true target value of
the data point. If TRUE
, the data are returned as a list of
length 2, where the first element is a dataframe of size
(n_samples, n_features)
corresponding to the feature matrix
(usually denoted X
), and the second element is a numeric
vector of length n_samples
corresponding to the list of
target values (usually denoted y
), respectively.TRUE
,
NA
s will be automatically dropped from the exported
dataset.Returns:
return_X_y == FALSE
, fetched dataset as a dataframe of size
(n_samples, n_features+1)
, where the last column is
target
. If return_X_y == TRUE
, a list of
length 2 containing the feature matrix X
and the target
array y
, respectively.Example:
library(pmlbr)
# Download features and labels for penguins dataset in single data frame
<- fetch_data('penguins')
penguins str(penguins)
## 'data.frame': 333 obs. of 8 variables:
## $ island : int 2 2 2 2 2 2 2 2 2 2 ...
## $ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: int 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
## $ sex : int 1 0 0 0 1 0 1 0 1 1 ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ target : int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
## ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
nearest_datasets
function
nearest_datasets(
x,y = NULL,
n_neighbors = 5,
dimensions = c("n_instances", "n_features"),
task = c("classification", "regression"),
target_name = "target",
...)
Select nearest datasets given dataset name or dataframe
x
.
Parameters:
x : str or dataframe
Character string of dataset name from PMLB, or data.frame of
n_samples
x n_features
(or
n_features+1
with a target column)
y : numeric vector Target column. Required when
x
does not contain the target column.
n_neighbors : integer Number of dataset names to return as neighbors.
dimensions : character vector Dataset characteristics to include in similarity calculation. Dimensions must correspond to numeric columns of all_summary_stats.tsv. If ‘all’ (default), uses all numeric columns.
target_name : str Character string specifying column of target/dependent variable.
task : str Character string specifying classification or regression for summary stat generation.
… Further arguments passed to each S3 method.
Returns: - dataset : character
vector Names of most similar datasets to x
, most
similar dataset first.
return_X_y == FALSE
, fetched dataset as a dataframe of size
(n_samples, n_features+1)
, where the last column is
target
. If return_X_y == TRUE
, a list of
length 2 containing the feature matrix X
and the target
array y
, respectively.Examples:
nearest_datasets('penguins')
## [1] "penguins" "ecoli" "schizo" "bupa"
## [5] "solar_flare_1"
nearest_datasets(fetch_data('penguins'))
## [1] "penguins" "ecoli" "schizo" "bupa"
## [5] "solar_flare_1"
These objects contain details on datasets currently available in PMLB.
dataset_names
A list of all datasets included in PMLB.
The contents of this vector are the union of
classification_dataset_names
and
regression_dataset_names
.
Example:
head(dataset_names, 9)
## [1] "1027_ESL" "1028_SWD" "1029_LEV"
## [4] "1030_ERA" "1089_USCrime" "1096_FacultySalaries"
## [7] "1191_BNG_pbc" "1193_BNG_lowbwt" "1196_BNG_pharynx"
classification_dataset_names
A list of all classification datasets included in PMLB.
Classification datasets are datasets where the target values (dependent variable/outcome) are discrete.
Example:
head(classification_dataset_names, 9)
## [1] "adult" "agaricus_lepiota" "allbp"
## [4] "allhyper" "allhypo" "allrep"
## [7] "analcatdata_aids" "analcatdata_asbestos" "analcatdata_authorship"
regression_dataset_names
A list of all regression datasets included in PMLB.
regression datasets are datasets where the target values (dependent variable/outcome) are continuous.
Example:
head(regression_dataset_names, 9)
## [1] "1027_ESL" "1028_SWD" "1029_LEV"
## [4] "1030_ERA" "1089_USCrime" "1096_FacultySalaries"
## [7] "1191_BNG_pbc" "1193_BNG_lowbwt" "1196_BNG_pharynx"
summary_stats
Summary statistics for the all datasets.
A data frame with 10 variables:
Example:
head(summary_stats)
## dataset n_instances n_features n_binary_features
## 1 1027_ESL 488 4 0
## 2 1028_SWD 1000 10 0
## 3 1029_LEV 1000 4 0
## 4 1030_ERA 1000 4 0
## 5 1089_USCrime 47 13 0
## 6 1096_FacultySalaries 50 4 0
## n_categorical_features n_continuous_features endpoint_type n_classes
## 1 0 4 continuous 9
## 2 0 10 continuous 4
## 3 0 4 continuous 5
## 4 0 4 continuous 9
## 5 0 13 continuous 42
## 6 0 4 continuous 39
## imbalance task
## 1 0.099363200 regression
## 2 0.108290667 regression
## 3 0.111245000 regression
## 4 0.031251250 regression
## 5 0.002970111 regression
## 6 0.004063158 regression