for contributing to PMLB!
We want this to be the easiest resource to use for benchmarking machine learning algorithms on many datasets. This is a community effort, and we rely on help from users like you.
Making a really easy-to-use benchmark resource also means being diligent about how contributions are made. Following these guidelines helps to communicate that you respect the time of the developers managing and developing this open source project. In return, we will reciprocate that respect in addressing your issue, assessing changes and helping you finalize your pull requests.
Please be kind. We will, too.
The main contribution our project needs at the moment is to identify the source of datasets and annotate the datasets that currently don’t have metadata information. Please see the Existing dataset annotation tab for more detail. We would also consider dataset addition that meet the format specifications of PMLB. We’re also open to other ideas including improving documentation, writing tutorials, etc.
If you haven’t contributed to open source code before, check out these friendly tutorials:
Those guides should tell you everything you need to start out!
When filing an issue, please make sure to answer these five questions:
New datasets should follow these guidelines:
'target'
.[0, 1, .. k]
,
where there are k
classes in the data.'target'
.'sample_id'
or 'row_id'
column should
be excluded..tsv.gz
format..tsv.gz
dataset file should be in the correct
folder; i.e., under pmlb/datasets/your_dataset/
metadata.yaml
file will be automatically generated in
the same folder, pushed back to your branch, and a reviewer will ask you
to manually review this file (e.g., add description, link to source,
etc.) in your pull request.Please let us know if you have any question along the way. We appreciate your contribution, and we want to help make your workflow as simple as possible!
Verify the source for the dataset.
Update the information on the dataset’s metadata.yaml file. Refer to the metadata template file or wine_quality_red as an example.
Issue a pull request for your changes.
There are a few ways we can check whether a PMLB dataframe
(pmlb_df
) agrees with its source (source_df
),
provided that we have checked their shapes (by printing
pmlb_df.shape
and pmlb_df.shape
) and changed
the column name of the dependent variable to target
. For
example, if the dependent variable in the source dataset is
class
, you can use
df_source = df_source.rename(columns={'class': 'target'})
.
pd.testing.assert_frame_equal(df_source, df_pmlb)
== df_pmlb.values).all() (df_source.values
If we still get False
, it is possible that the rows have
been permuted. To check if they are:
set(df_pmlb.itertuples(index=False)) == set(df_source.itertuples(index=False))
or “subtracting” the two datasets row by row (these two lines will print the rows that are in one dataframe but not the other and can help you see the difference a bit better):
=True, how='left').loc[lambda x: x['_merge']!='both']
df_source.merge(df_pmlb, indicator=True, how='left').loc[lambda x: x['_merge']!='both'] df_pmlb.merge(df_source, indicator
If the two dataframes have floats that are almost equal to each
other, we can use numpy
’s isclose
to check if
they are element-wise equal within a tolerance:
from numpy import isclose
all() isclose(df_source.values, df_pmlb.values).
We have been using Google Colab notebooks to share our checks, but other methods are also welcomed. If you do use Google Colab, please add your notebook to our shared folder http://tiny.cc/pmlb-colabs and/or share a publicly accessible link to your notebook in the pull request when you submit one. For example notebooks, you can visit the shared Drive folder.
Please submit your contribution in a PR from a branch of your fork (NOT master branch) and, if possible, with only 1 commit. Nonetheless, we can help you fix it if this is not the case.
get_updated_metadatas()
picks up this new dataset by
checking the diff of the latest commit. A GitHub Action (GA) regenerates
summary_stats.tsv
, updates
classification_dataset_names
and
regression_dataset_names
, and adds the dataset name to
dataset_with_metadata
(which contains the names of datasets
with customized metadata files). GA then pushes back these changes to
your branch as long as it’s not master.metadata.yaml
may be invalid. Please use http://yamllint.com to
verify. A common problem that invalidate a .yaml
file is
the use of colon :
without quotes (in a publication title
for example).get_updated_datasets()
picks up this new dataset by
checking the diff of the latest commit. GA autogenerates
metadata.yaml
, readme.md
,
summary_stats.tsv
, all_summary_stats.tsv
,
pandas profiling, and updates dataset_names
(and
classification_dataset_names
and
regression_dataset_names
).metadata.yaml
.dataset_with_metadata
,
which contains the names of datasets with customized metadata files, and
push back these changes to your branch as long as it’s not master branch
or main branch.[regenerate_profiles]
in the commit
message.[update_all_datasets]
in the commit
message.