TPOT API
Classification
class tpot.TPOTClassifier(generations=100, population_size=100, offspring_size=None, mutation_rate=0.9, crossover_rate=0.1, scoring='accuracy', cv=5, subsample=1.0, n_jobs=1, max_time_mins=None, max_eval_time_mins=5, random_state=None, config_dict=None, template=None, warm_start=False, memory=None, use_dask=False, periodic_checkpoint_folder=None, early_stop=None, verbosity=0, disable_update_check=False, log_file=None )
Automated machine learning for supervised classification tasks.
The TPOTClassifier performs an intelligent search over machine learning pipelines that can contain supervised classification models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API. The TPOTClassifier will also search over the hyperparameters of all objects in the pipeline.
By default, TPOTClassifier will search over a broad range of supervised classification algorithms, transformers, and their parameters.
However, the algorithms, transformers, and hyperparameters that the TPOTClassifier searches over can be fully customized using the config_dict
parameter.
Read more in the User Guide.
Parameters: |
generations: int or None optional (default=100)
Number of iterations to the run pipeline optimization process. It must be a positive number or None. If None, the parameter max_time_mins must be defined as the runtime limit.population_size: int, optional (default=100) Number of individuals to retain in the genetic programming population every generation. Must be a positive number.offspring_size: int, optional (default=None) Number of offspring to produce in each genetic programming generation. Must be a positive number. By default, the number of offspring is equal to the number of population size.mutation_rate: float, optional (default=0.9) Mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.crossover_rate: float, optional (default=0.1) Crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation.scoring: string or callable, optional (default='accuracy') Function used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used:cv: int, cross-validation generator, or an iterable, optional (default=5) Cross-validation strategy used when evaluating pipelines.subsample: float, optional (default=1.0) Fraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0].n_jobs: integer, optional (default=1) Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.max_time_mins: integer or None, optional (default=None) How many minutes TPOT has to optimize the pipeline.max_eval_time_mins: float, optional (default=5) How many minutes TPOT has to evaluate a single pipeline.random_state: integer or None, optional (default=None) The seed of the pseudo random number generator used in TPOT.config_dict: Python dictionary, string, or None, optional (default=None) A configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.template: string (default=None) Template of predefined pipeline structure. The option is for specifying a desired structure for the machine learning pipeline evaluated in TPOT.warm_start: boolean, optional (default=False) Flag indicating whether the TPOT instance will reuse the population from previous calls to fit().memory: a joblib.Memory object or string, optional (default=None) If supplied, pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical with another fitted pipeline during optimization process. More details about memory caching in scikit-learn documentationuse_dask: boolean, optional (default: False) Whether to use Dask-ML's pipeline optimiziations. This avoid re-fitting the same estimator on the same split of data multiple times. It will also provide more detailed diagnostics when using Dask's distributed scheduler.periodic_checkpoint_folder: path string, optional (default: None) If supplied, a folder in which TPOT will periodically save pipelines in pareto front so far while optimizing.early_stop: integer, optional (default: None) How many generations TPOT checks whether there is no improvement in optimization process.verbosity: integer, optional (default=0) How much information TPOT communicates while it's running.disable_update_check: boolean, optional (default=False) Flag indicating whether the TPOT version checker should be disabled.log_file: file-like class (io.TextIOWrapper or io.StringIO) or string, optional (default: None) Save progress content to a file. If it is a string for the path and file name of the desired output file, TPOT will create the file and write log into it. If it is None, TPOT will output log into sys.stdout |
Attributes: |
fitted_pipeline_: scikit-learn Pipeline object
The best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset.pareto_front_fitted_pipelines_: Python dictionary Dictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset.evaluated_individuals_: Python dictionary Dictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline). |
Example
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')
Functions
fit(features, classes[, sample_weight, groups]) | Run the TPOT optimization process on the given training data. |
predict(features) | Use the optimized pipeline to predict the classes for a feature set. |
predict_proba(features) | Use the optimized pipeline to estimate the class probabilities for a feature set. |
score(testing_features, testing_classes) | Returns the optimized pipeline's score on the given testing data using the user-specified scoring function. |
export(output_file_name) | Export the optimized pipeline as Python code. |
Uses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. This pipeline optimization procedure uses internal k-fold cross-validaton to avoid overfitting on the provided data. At the end of the pipeline optimization procedure, the best pipeline is then trained on the entire set of provided samples.
Parameters: |
features: array-like {n_samples, n_features}
Feature matrixclasses: array-like {n_samples} List of class labels for predictionsample_weight: array-like {n_samples}, optional Per-sample weights. Higher weights indicate more importance. If specified, sample_weight will be passed to any pipeline element whose fit() function accepts a sample_weight argument. By default, using sample_weight does not affect tpot's scoring functions, which determine preferences between pipelines.groups: array-like, with shape {n_samples, }, optional Group labels for the samples used when performing cross-validation. |
Returns: |
self: object
Returns a copy of the fitted TPOT object |
Parameters: |
features: array-like {n_samples, n_features}
Feature matrix |
Returns: |
predictions: array-like {n_samples}
Predicted classes for the samples in the feature matrix |
Note: This function will only work for pipelines whose final classifier supports the predict_proba function. TPOT will raise an error otherwise.
Parameters: |
features: array-like {n_samples, n_features}
Feature matrix |
Returns: |
predictions: array-like {n_samples, n_classes}
The class probabilities of the input samples |
The default scoring function for TPOTClassifier is 'accuracy'.
Parameters: |
testing_features: array-like {n_samples, n_features}
Feature matrix of the testing settesting_classes: array-like {n_samples} List of class labels for prediction in the testing set |
Returns: |
accuracy_score: float
The estimated test set accuracy according to the user-specified scoring function. |
See the usage documentation for example usage of the export function.
Parameters: |
output_file_name: string
String containing the path and file name of the desired output filedata_file_path: string By default, the path of input dataset is 'PATH/TO/DATA/FILE' by default. If data_file_path is another string, the path will be replaced. |
Returns: |
exported_code_string: string
The whole pipeline text as a string should be returned if output_file_name is not specified. |
Regression
class tpot.TPOTRegressor(generations=100, population_size=100, offspring_size=None, mutation_rate=0.9, crossover_rate=0.1, scoring='neg_mean_squared_error', cv=5, subsample=1.0, n_jobs=1, max_time_mins=None, max_eval_time_mins=5, random_state=None, config_dict=None, template=None, warm_start=False, memory=None, use_dask=False, periodic_checkpoint_folder=None, early_stop=None, verbosity=0, disable_update_check=False)
Automated machine learning for supervised regression tasks.
The TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API. The TPOTRegressor will also search over the hyperparameters of all objects in the pipeline.
By default, TPOTRegressor will search over a broad range of supervised regression models, transformers, and their hyperparameters.
However, the models, transformers, and parameters that the TPOTRegressor searches over can be fully customized using the config_dict
parameter.
Read more in the User Guide.
Parameters: |
generations: int or None, optional (default=100)
Number of iterations to the run pipeline optimization process. It must be a positive number or None. If None, the parameter max_time_mins must be defined as the runtime limit.population_size: int, optional (default=100) Number of individuals to retain in the genetic programming population every generation. Must be a positive number.offspring_size: int, optional (default=None) Number of offspring to produce in each genetic programming generation. Must be a positive number. By default, the number of offspring is equal to the number of population size.mutation_rate: float, optional (default=0.9) Mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.crossover_rate: float, optional (default=0.1) Crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation.scoring: string or callable, optional (default='neg_mean_squared_error') Function used to evaluate the quality of a given pipeline for the regression problem. The following built-in scoring functions can be used:cv: int, cross-validation generator, or an iterable, optional (default=5) Cross-validation strategy used when evaluating pipelines.subsample: float, optional (default=1.0) Fraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0].n_jobs: integer, optional (default=1) Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.max_time_mins: integer or None, optional (default=None) How many minutes TPOT has to optimize the pipeline.max_eval_time_mins: float, optional (default=5) How many minutes TPOT has to evaluate a single pipeline.random_state: integer or None, optional (default=None) The seed of the pseudo random number generator used in TPOT.config_dict: Python dictionary, string, or None, optional (default=None) A configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.template: string (default=None) Template of predefined pipeline structure. The option is for specifying a desired structure for the machine learning pipeline evaluated in TPOT.warm_start: boolean, optional (default=False) Flag indicating whether the TPOT instance will reuse the population from previous calls to fit().memory: a joblib.Memory object or string, optional (default=None) If supplied, pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical with another fitted pipeline during optimization process. More details about memory caching in scikit-learn documentationuse_dask: boolean, optional (default: False) Whether to use Dask-ML's pipeline optimiziations. This avoid re-fitting the same estimator on the same split of data multiple times. It will also provide more detailed diagnostics when using Dask's distributed scheduler.periodic_checkpoint_folder: path string, optional (default: None) If supplied, a folder in which TPOT will periodically save pipelines in pareto front so far while optimizing.early_stop: integer, optional (default: None) How many generations TPOT checks whether there is no improvement in optimization process.verbosity: integer, optional (default=0) How much information TPOT communicates while it's running.disable_update_check: boolean, optional (default=False) Flag indicating whether the TPOT version checker should be disabled. |
Attributes: |
fitted_pipeline_: scikit-learn Pipeline object
The best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset.pareto_front_fitted_pipelines_: Python dictionary Dictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset.evaluated_individuals_: Python dictionary Dictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline). |
Example
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
digits = load_boston()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
Functions
fit(features, target[, sample_weight, groups]) | Run the TPOT optimization process on the given training data. |
predict(features) | Use the optimized pipeline to predict the target values for a feature set. |
score(testing_features, testing_target) | Returns the optimized pipeline's score on the given testing data using the user-specified scoring function. |
export(output_file_name) | Export the optimized pipeline as Python code. |
Uses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. This pipeline optimization procedure uses internal k-fold cross-validaton to avoid overfitting on the provided data. At the end of the pipeline optimization procedure, the best pipeline is then trained on the entire set of provided samples.
Parameters: |
features: array-like {n_samples, n_features}
Feature matrixtarget: array-like {n_samples} List of target labels for predictionsample_weight: array-like {n_samples}, optional Per-sample weights. Higher weights indicate more importance. If specified, sample_weight will be passed to any pipeline element whose fit() function accepts a sample_weight argument. By default, using sample_weight does not affect tpot's scoring functions, which determine preferences between pipelines.groups: array-like, with shape {n_samples, }, optional Group labels for the samples used when performing cross-validation. |
Returns: |
self: object
Returns a copy of the fitted TPOT object |
Parameters: |
features: array-like {n_samples, n_features}
Feature matrix |
Returns: |
predictions: array-like {n_samples}
Predicted target values for the samples in the feature matrix |
The default scoring function for TPOTRegressor is 'mean_squared_error'.
Parameters: |
testing_features: array-like {n_samples, n_features}
Feature matrix of the testing settesting_target: array-like {n_samples} List of target labels for prediction in the testing set |
Returns: |
accuracy_score: float
The estimated test set accuracy according to the user-specified scoring function. |
See the usage documentation for example usage of the export function.
Parameters: |
output_file_name: string
String containing the path and file name of the desired output filedata_file_path: string By default, the path of input dataset is 'PATH/TO/DATA/FILE' by default. If data_file_path is another string, the path will be replaced. |
Returns: |
exported_code_string: string
The whole pipeline text as a string should be returned if output_file_name is not specified. |