Introduction

Thank you

for contributing to PMLB!

We want this to be the easiest resource to use for benchmarking machine learning algorithms on many datasets. This is a community effort, and we rely on help from users like you.

Why you should read this

Making a really easy-to-use benchmark resource also means being diligent about how contributions are made. Following these guidelines helps to communicate that you respect the time of the developers managing and developing this open source project. In return, we will reciprocate that respect in addressing your issue, assessing changes and helping you finalize your pull requests.

Ground rules

Please be kind. We will, too.

Responsibilities

The main contribution our project needs at the moment is to identify the source of datasets and annotate the datasets that currently don’t have metadata information. Please see the Existing dataset annotation tab for more detail. We would also consider dataset addition that meet the format specifications of PMLB. We’re also open to other ideas including improving documentation, writing tutorials, etc.

  • For source identification and annotation of existing datasets, make sure your pull request follows our source guidelines in the Existing dataset annotation tab.
  • For new datasets, please make sure your pull request follows our new dataset guidelines under the Add new dataset tab.
  • Create issues for any major changes and enhancements that you wish to make. Discuss things transparently and get community feedback.
  • Be welcoming to newcomers and encourage new contributors from all backgrounds. See the Python Community Code of Conduct as an example.

Getting started

Your first contribution

If you haven’t contributed to open source code before, check out these friendly tutorials:

Those guides should tell you everything you need to start out!

How to submit a contribution

  1. Create your own fork of this repository
  2. Make the changes in your fork on a branch that is NOT master or main branch.
  3. If you think the project would benefit from these changes:
    • Make sure you have followed the guidelines above.
    • Submit a pull request.

How to report a bug

When filing an issue, please make sure to answer these five questions:

  1. What version of PMLB are you using?
  2. What operating system and processor architecture are you using?
  3. What did you do?
  4. What did you expect to see?
  5. What did you see instead?

Add new dataset

New datasets should follow these guidelines:

  • Each sample/observation forms a row of the dataset.
  • Each feature/variable forms a column of the dataset.
  • The dependent variable/outcome/target should be labelled 'target'.
  • If the task is classification, the dependent variable must be encoded with numeric, contiguous labels in [0, 1, .. k], where there are k classes in the data.
  • Column headers are feature/variable names and 'target'.
  • Any 'sample_id' or 'row_id' column should be excluded.
  • The data should be tab-delimited and in .tsv.gz format.
  • The .tsv.gz dataset file should be in the correct folder; i.e., under pmlb/datasets/your_dataset/
  • A metadata.yaml file will be automatically generated in the same folder, pushed back to your branch, and a reviewer will ask you to manually review this file (e.g., add description, link to source, etc.) in your pull request.

Please let us know if you have any question along the way. We appreciate your contribution, and we want to help make your workflow as simple as possible!

Relevant issues: #13, #22.

Existing dataset annotation

  1. Verify the source for the dataset.

    • Often the place to start is an internet search of the dataset name. Most datasets can be found in OpenML, the UC Irvine ML repository, or Kaggle.
    • Follow the How to verify source subsection below to verify that the PMLB dataset actually came from the source you found.
  2. Update the information on the dataset’s metadata.yaml file. Refer to the metadata template file or wine_quality_red as an example.

  3. Issue a pull request for your changes.

How to verify source

There are a few ways we can check whether a PMLB dataframe (pmlb_df) agrees with its source (source_df), provided that we have checked their shapes (by printing pmlb_df.shape and pmlb_df.shape) and changed the column name of the dependent variable to target. For example, if the dependent variable in the source dataset is class, you can use df_source = df_source.rename(columns={'class': 'target'}).

  • If the two dataframes are exactly the same, the following line of code does not return anything ✔️:
pd.testing.assert_frame_equal(df_source, df_pmlb)
  • If it gives error, the column names may be different. If we have good reasons to ignore column names, we can check if the values contained in the 2 dataframes are the same with
(df_source.values == df_pmlb.values).all()

If we still get False, it is possible that the rows have been permuted. To check if they are:

set(df_pmlb.itertuples(index=False)) == set(df_source.itertuples(index=False))

or “subtracting” the two datasets row by row (these two lines will print the rows that are in one dataframe but not the other and can help you see the difference a bit better):

df_source.merge(df_pmlb, indicator=True, how='left').loc[lambda x: x['_merge']!='both']
df_pmlb.merge(df_source, indicator=True, how='left').loc[lambda x: x['_merge']!='both']

If the two dataframes have floats that are almost equal to each other, we can use numpy’s isclose to check if they are element-wise equal within a tolerance:

from numpy import isclose

isclose(df_source.values, df_pmlb.values).all()

We have been using Google Colab notebooks to share our checks, but other methods are also welcomed. If you do use Google Colab, please add your notebook to our shared folder http://tiny.cc/pmlb-colabs and/or share a publicly accessible link to your notebook in the pull request when you submit one. For example notebooks, you can visit the shared Drive folder.

Working with GitHub Actions

Please submit your contribution in a PR from a branch of your fork (NOT master branch) and, if possible, with only 1 commit. Nonetheless, we can help you fix it if this is not the case.

  • If you contributed metadata:
    • Let’s assume that the inferred “task” is incorrect and you fixed this. get_updated_metadatas() picks up this new dataset by checking the diff of the latest commit. A GitHub Action (GA) regenerates summary_stats.tsv, updates classification_dataset_names and regression_dataset_names, and adds the dataset name to dataset_with_metadata (which contains the names of datasets with customized metadata files). GA then pushes back these changes to your branch as long as it’s not master.
    • If your PR resulted in a failed build, your metadata.yaml may be invalid. Please use http://yamllint.com to verify. A common problem that invalidate a .yaml file is the use of colon : without quotes (in a publication title for example).
  • If you contributed a new dataset:
    • get_updated_datasets() picks up this new dataset by checking the diff of the latest commit. GA autogenerates metadata.yaml, readme.md, summary_stats.tsv, all_summary_stats.tsv, pandas profiling, and updates dataset_names (and classification_dataset_names and regression_dataset_names).
    • A reviewer asks you to manually update the metadata.yaml.
    • GA adds the dataset name to dataset_with_metadata, which contains the names of datasets with customized metadata files, and push back these changes to your branch as long as it’s not master branch or main branch.
  • If you wished to regenerate all the pandas profiling reports:
    • Please include [regenerate_profiles] in the commit message.
  • If you wished to regenerate/update supporting files for all datasets:
    • Please include [update_all_datasets] in the commit message.