Intro¶
TPOT gives the user a lot of options for customizing the search space, from hyperparameter ranges to model selection to pipeline configuration. TPOT is able to select models, optimize their hyperparameters, and build a complex pipeline structure. Each level of detail has multiple customization options. This tutorial will first explore how to set up a hyperparameter search space for a single method. Next, we will describe how to set up simultaneous model selection and hyperparameter tuning. Finally, we will cover how to utilize these steps to configure a search space for a fixed pipeline of multiple steps, as well as having TPOT optimize the pipeline structure itself.
Hyperparameter Search Spaces with ConfigSpace¶
Hyperparameter search spaces are defined using the ConfigSpace package found here. More information on how to set up a hyperparameter space can be found in their documentation here.
TPOT uses ConfigSpace.ConfigurationSpace
objects to define the hyperparameter search space for individual models. This object can be used to keep track of the desired hyperparameters as well as provide functions for random sampling from this space.
In short, you can use the Integer
, Float
, and Categorical
functions of ConfigSpace
to define a range of values used for each param. Alternatively, a tuple with (min,max) ints or floats can be used to specify an int/float search space and a list can be used to specify a categorical search space. A fixed value can also be provided for parameters that are not tunned. The space parameter of ConfigurationSpace
takes in a dictionary of param names to these ranges.
Note: If you want reproducible results, you need to set a fixed random_state in the search space.
Here is an example of a hyperparameter range for RandomForest
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.ensemble import RandomForestClassifier
import tpot2
import numpy as np
import sklearn
import sklearn.datasets
rf_configspace = ConfigurationSpace(
space = {
'n_estimators': 128, #as recommended by Oshiro et al. (2012
'max_features': Float("max_features", bounds=(0.01,1), log=True), #log scale like autosklearn?
'criterion': Categorical("criterion", ['gini', 'entropy']),
'min_samples_split': Integer("min_samples_split", bounds=(2, 20)),
'min_samples_leaf': Integer("min_samples_leaf", bounds=(1, 20)),
'bootstrap': Categorical("bootstrap", [True, False]),
#random_state = 1, # If you want results to be reproducible, you can set a fixed random_state.
}
)
hyperparameters = dict(rf_configspace.sample_configuration())
print("sampled hyperparameters")
print(hyperparameters)
rf = RandomForestClassifier(**hyperparameters)
rf
sampled hyperparameters {'bootstrap': False, 'criterion': 'entropy', 'max_features': 0.1574830347299, 'min_samples_leaf': 10, 'min_samples_split': 6, 'n_estimators': 128}
RandomForestClassifier(bootstrap=False, criterion='entropy', max_features=0.1574830347299, min_samples_leaf=10, min_samples_split=6, n_estimators=128)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(bootstrap=False, criterion='entropy', max_features=0.1574830347299, min_samples_leaf=10, min_samples_split=6, n_estimators=128)
More simply:
rf_configspace = ConfigurationSpace(
space = {
'n_estimators': 128, #as recommended by Oshiro et al. (2012
'max_features':(0.01,1), #not log scaled
'criterion': ['gini', 'entropy'],
'min_samples_split': (2, 20),
'min_samples_leaf': (1, 20),
'bootstrap': [True, False],
#random_state = 1, # If you want results to be reproducible, you can set a fixed random_state.
}
)
hyperparameters = dict(rf_configspace.sample_configuration())
print("sampled hyperparameters")
print(hyperparameters)
rf = RandomForestClassifier(**hyperparameters)
rf
sampled hyperparameters {'bootstrap': True, 'criterion': 'entropy', 'max_features': 0.2601475241557, 'min_samples_leaf': 17, 'min_samples_split': 3, 'n_estimators': 128}
RandomForestClassifier(criterion='entropy', max_features=0.2601475241557, min_samples_leaf=17, min_samples_split=3, n_estimators=128)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', max_features=0.2601475241557, min_samples_leaf=17, min_samples_split=3, n_estimators=128)
TPOT Search spaces¶
TPOT allows you to create hyperparameter search spaces for individual methods and pipeline structure search spaces. For example, TPOT can create linear pipelines, trees, or graphs.
TPOT search spaces are found in the search_spaces
module. There are two primary kinds of search spaces, node and pipeline. Node search spaces specify a single sklearn BaseEstimator
search space. Pipeline search spaces define the possible structures for a group of node search spaces. These take in node search spaces and produce a pipeline using nodes from that search space. Since sklearn Pipelines are also BaseEstimator
, pipeline search spaces are also technically node search spaces. This means that pipeline search spaces can take in other pipeline search spaces in order to define more complex structures. The primary differentiating factor between node and pipeline search spaces is that pipeline search spaces must take in another search space as input to feed its individual nodes. Therefore, all search spaces eventually end in a node search space at the lowest level. Note that parameters for pipeline search spaces can differ, some take in only a single search space, some take in a list, or some take in multiple defined parameters.
node search spaces¶
Name | Info |
---|---|
EstimatorNode | Takes in a ConfigSpace along with the class of the method. This node will optimize the hyperparameters for a single method. |
GeneticFeatureSelectorNode | Uses evolution to optimize a set of features, exports a basic sklearn Selector that simply selects the features chosen by the node. |
FSSNode | FSS stands for FeatureSetSelector. This node takes in a list of user-defined subsets of features and selects a single predefined subset. Note that TPOT will not create new subsets nor will it select multiple subsets per node. If using a linear pipeline, this node should be set as the first step. In linear pipelines it is recommended that you only use a small number of feature sets. I recommend exploring using FSSNodes in pipelines that allow TPOT to select more than one FSSNode at a time. For example, DynamicUnionPipeline and GraphPipeline are both excellent combos for FSSNode. Use FFSNode inside a DynamicUnionPipeline at the start of linear pipeline to explore optimal combinations of subsets in linear pipelines. Set the leaf_search_space of GraphSearchPipeline TPOT can use multiple feature sets in different ways, for example, with different transformers for different sets. |
pipeline search spaces¶
found in tpot2.search_spaces.pipelines
WrapperPipeline - This search space is for wrapping a sklearn estimator with a method that takes another estimator and hyperparameters as arguments. For example, this can be used with sklearn.ensemble.BaggingClassifier or sklearn.ensemble.AdaBoostClassifier.
Name | Info |
---|---|
ChoicePipeline | Takes in a list of search spaces. Will select one node from the search space. |
SequentialPipeline | Takes in a list of search spaces. will produce a pipeline of Sequential length. Each step in the pipeline will correspond to the the search space provided in the same index. |
DynamicLinearPipeline | Takes in a single search space. Will produce a linear pipeline of variable length. Each step in the pipeline will be pulled from the search space provided. |
UnionPipeline | Takes in a list of search spaces. The returned pipeline will include one estimator per search space joined in an sklearn FeatureUnion. Useful for having many steps in one layer. |
DynamicUnionPipeline | Takes in a single search space. It will pull anywhere from 1 to max_estimators number of estimators from the search space and concatenate them in a FeatureUnion. |
TreePipeline | Generates a pipeline of variable length. Pipeline will have a tree structure similar to TPOT1. |
GraphSearchPipeline | Generates a directed acyclic graph of variable size. Search spaces for root, leaf, and inner nodes can be defined separately if desired. |
WrapperPipeline | This search space is for wrapping a sklearn estimator with a method that takes another estimator and hyperparameters as arguments. For example, this can be used with sklearn.ensemble.BaggingClassifier or sklearn.ensemble.AdaBoostClassifier. |
Node Search Space Examples¶
Node search spaces represent the smallest unit of an sklearn pipeline. All node search spaces create and optimize a single node which exports a single estimator object. For example this could be a KNeighborsClassifier or a FeatureSetSelector.
EstimatorNode¶
The EstimatorNode represents the hyperparameter search space for a scikit-learn estimator.
import tpot2
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.neighbors import KNeighborsClassifier
knn_configspace = ConfigurationSpace(
space = {
'n_neighbors': Integer("n_neighbors", bounds=(1, 10)),
'weights': Categorical("weights", ['uniform', 'distance']),
'p': Integer("p", bounds=(1, 3)),
'metric': Categorical("metric", ['euclidean', 'minkowski']),
'n_jobs': 1,
}
)
knn_node = tpot2.search_spaces.nodes.EstimatorNode(
method = KNeighborsClassifier,
space = knn_configspace,
)
You can sample generate an individual with the generate() function. This individual samples from the search space as well as provides mutation and crossover functions to modify the current sample.
knn_individual = knn_node.generate()
knn_individual
<tpot2.search_spaces.nodes.estimator_node.EstimatorNodeIndividual at 0x78ec45f53430>
print("sampled hyperparameters")
print(knn_individual.hyperparameters)
sampled hyperparameters {'metric': 'euclidean', 'n_jobs': 1, 'n_neighbors': 9, 'p': 1, 'weights': 'uniform'}
All Individual objects have mutation and crossover operators that TPOT uses to optimize the pipelines.
knn_individual.mutate() # mutate the individual
print("mutated hyperparameters")
print(knn_individual.hyperparameters)
mutated hyperparameters {'metric': 'minkowski', 'n_jobs': 1, 'n_neighbors': 3, 'p': 3, 'weights': 'distance'}
In TPOT2, crossover only modifies the individual calling the crossover function, the second individual remains the same
knn_individual1 = knn_node.generate()
knn_individual2 = knn_node.generate()
print("original hyperparameters for individual 1")
print(knn_individual1.hyperparameters)
print("original hyperparameters for individual 2")
print(knn_individual2.hyperparameters)
print()
knn_individual1.crossover(knn_individual2) # crossover the individuals
print("post crossover hyperparameters for individual 1")
print(knn_individual1.hyperparameters)
print("post crossover hyperparameters for individual 2")
print(knn_individual2.hyperparameters)
original hyperparameters for individual 1 {'metric': 'minkowski', 'n_jobs': 1, 'n_neighbors': 6, 'p': 2, 'weights': 'distance'} original hyperparameters for individual 2 {'metric': 'euclidean', 'n_jobs': 1, 'n_neighbors': 4, 'p': 2, 'weights': 'uniform'} post crossover hyperparameters for individual 1 {'metric': 'euclidean', 'n_jobs': 1, 'n_neighbors': 6, 'p': 2, 'weights': 'uniform'} post crossover hyperparameters for individual 2 {'metric': 'euclidean', 'n_jobs': 1, 'n_neighbors': 4, 'p': 2, 'weights': 'uniform'}
All search spaces have an export_pipeline function that returns an sklearn BaseEstimator
est = knn_individual1.export_pipeline()
est
KNeighborsClassifier(metric='euclidean', n_jobs=1, n_neighbors=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(metric='euclidean', n_jobs=1, n_neighbors=6)
If a dictionary of parameters is passed instead of of a ConfigSpace object, then the hyperparameters will always be fixed and not learned.
import tpot2
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.neighbors import KNeighborsClassifier
space = {
'n_neighbors':10,
}
knn_node = tpot2.search_spaces.nodes.EstimatorNode(
method = KNeighborsClassifier,
space = space,
)
knn_node.generate().export_pipeline()
KNeighborsClassifier(n_neighbors=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=10)
FSSNode and GeneticFeatureSelectorNode¶
Both of these are given their own tutorials. See Tutorial 3 for FFSNode and Tutorial 5 for GeneticFeatureSelectorNode
Pipeline Search Space Examples¶
Pipeline search spaces are used to define the structure and restrictions of the pipelines TPOT can search. Unlike Node search spaces, all pipeline search spaces take in other search spaces as inputs. Rather than sample hyperparameters, pipeline search spaces can select models from the input search spaces and organize them within a linear sklearn Pipeline or a TPOT GraphPipeline.
ChoicePipeline¶
The simplest pipeline search space is the ChoicePipeline. This takes in a list of search spaces and simply selects and samples from one. In this example, we will construct a search space that takes in several options for a classifier. The resulting search space will then first select a model from KNeighborsClassifier, LogisticRegression or DecisionTreeClassifier, and then select the hyperparameters for the given model.
import tpot2
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
knn_configspace = ConfigurationSpace(
space = {
'n_neighbors': Integer("n_neighbors", bounds=(1, 10)),
'weights': Categorical("weights", ['uniform', 'distance']),
'p': Integer("p", bounds=(1, 3)),
'metric': Categorical("metric", ['euclidean', 'minkowski']),
'n_jobs': 1,
}
)
lr_configspace = ConfigurationSpace(
space = {
'solver': Categorical("solver", ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']),
'penalty': Categorical("penalty", ['l1', 'l2']),
'dual': Categorical("dual", [True, False]),
'C': Float("C", bounds=(1e-4, 1e4), log=True),
'class_weight': Categorical("class_weight", ['balanced']),
'n_jobs': 1,
'max_iter': 1000,
}
)
dt_configspace = ConfigurationSpace(
space = {
'criterion': Categorical("criterion", ['gini', 'entropy']),
'max_depth': Integer("max_depth", bounds=(1, 11)),
'min_samples_split': Integer("min_samples_split", bounds=(2, 21)),
'min_samples_leaf': Integer("min_samples_leaf", bounds=(1, 21)),
'max_features': Categorical("max_features", ['sqrt', 'log2']),
'min_weight_fraction_leaf': 0.0,
}
)
knn_node = tpot2.search_spaces.nodes.EstimatorNode(
method = KNeighborsClassifier,
space = knn_configspace,
)
lr_node = tpot2.search_spaces.nodes.EstimatorNode(
method = LogisticRegression,
space = lr_configspace,
)
dt_node = tpot2.search_spaces.nodes.EstimatorNode(
method = DecisionTreeClassifier,
space = dt_configspace,
)
classifier_node = tpot2.search_spaces.pipelines.ChoicePipeline(
search_spaces=[
knn_node,
lr_node,
dt_node,
]
)
tpot2.search_spaces.pipelines.ChoicePipeline(
search_spaces = [
tpot2.search_spaces.nodes.EstimatorNode(
method = KNeighborsClassifier,
space = knn_configspace,
),
tpot2.search_spaces.nodes.EstimatorNode(
method = LogisticRegression,
space = lr_configspace,
),
tpot2.search_spaces.nodes.EstimatorNode(
method = DecisionTreeClassifier,
space = dt_configspace,
),
]
)
<tpot2.search_spaces.pipelines.choice.ChoicePipeline at 0x78eb391763b0>
Search space objects provided by pipeline search spaces work the same as with node search spaces. Note that crossover only works when both individuals have sampled the same method.
classifier_individual = classifier_node.generate()
print("sampled pipeline")
classifier_individual.export_pipeline()
sampled pipeline
LogisticRegression(C=0.0008500633703, class_weight='balanced', max_iter=1000, n_jobs=1, penalty='l1', solver='saga')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.0008500633703, class_weight='balanced', max_iter=1000, n_jobs=1, penalty='l1', solver='saga')
print("mutated pipeline")
classifier_individual.mutate()
classifier_individual.export_pipeline()
mutated pipeline
LogisticRegression(C=0.1054489422979, class_weight='balanced', max_iter=1000, n_jobs=1, penalty='l1', solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.1054489422979, class_weight='balanced', max_iter=1000, n_jobs=1, penalty='l1', solver='liblinear')
Built in search spaces for EstimatorNode and ChoicePipeline¶
TPOT2 also comes with predefined hyperparameter search spaces. The current search spaces were adapted from a combination of the original TPOT package as well as the search spaces used in AutoSklearn. The helper function tpot2.config.get_search_space
takes in a string or a list of strings, and returns either a EstimatorNode or a ChoicePipeline (including all methods in the list), respectively.
String | Corresponding Method |
---|---|
SGDClassifier | <class 'sklearn.linear_model._stochastic_gradient.SGDClassifier'> |
RandomForestClassifier | <class 'sklearn.ensemble._forest.RandomForestClassifier'> |
ExtraTreesClassifier | <class 'sklearn.ensemble._forest.ExtraTreesClassifier'> |
GradientBoostingClassifier | <class 'sklearn.ensemble._gb.GradientBoostingClassifier'> |
MLPClassifier | <class 'sklearn.neural_network._multilayer_perceptron.MLPClassifier'> |
DecisionTreeClassifier | <class 'sklearn.tree._classes.DecisionTreeClassifier'> |
XGBClassifier | <class 'xgboost.sklearn.XGBClassifier'> |
KNeighborsClassifier | <class 'sklearn.neighbors._classification.KNeighborsClassifier'> |
SVC | <class 'sklearn.svm._classes.SVC'> |
LogisticRegression | <class 'sklearn.linear_model._logistic.LogisticRegression'> |
LGBMClassifier | <class 'lightgbm.sklearn.LGBMClassifier'> |
LinearSVC | <class 'sklearn.svm._classes.LinearSVC'> |
GaussianNB | <class 'sklearn.naive_bayes.GaussianNB'> |
BernoulliNB | <class 'sklearn.naive_bayes.BernoulliNB'> |
MultinomialNB | <class 'sklearn.naive_bayes.MultinomialNB'> |
ExtraTreesRegressor | <class 'sklearn.ensemble._forest.ExtraTreesRegressor'> |
RandomForestRegressor | <class 'sklearn.ensemble._forest.RandomForestRegressor'> |
GradientBoostingRegressor | <class 'sklearn.ensemble._gb.GradientBoostingRegressor'> |
BaggingRegressor | <class 'sklearn.ensemble._bagging.BaggingRegressor'> |
DecisionTreeRegressor | <class 'sklearn.tree._classes.DecisionTreeRegressor'> |
KNeighborsRegressor | <class 'sklearn.neighbors._regression.KNeighborsRegressor'> |
XGBRegressor | <class 'xgboost.sklearn.XGBRegressor'> |
ZeroCount | <class 'tpot2.builtin_modules.zero_count.ZeroCount'> |
ColumnOneHotEncoder | <class 'tpot2.builtin_modules.column_one_hot_encoder.ColumnOneHotEncoder'> |
Binarizer | <class 'sklearn.preprocessing._data.Binarizer'> |
FastICA | <class 'sklearn.decomposition._fastica.FastICA'> |
FeatureAgglomeration | <class 'sklearn.cluster._agglomerative.FeatureAgglomeration'> |
MaxAbsScaler | <class 'sklearn.preprocessing._data.MaxAbsScaler'> |
MinMaxScaler | <class 'sklearn.preprocessing._data.MinMaxScaler'> |
Normalizer | <class 'sklearn.preprocessing._data.Normalizer'> |
Nystroem | <class 'sklearn.kernel_approximation.Nystroem'> |
PCA | <class 'sklearn.decomposition._pca.PCA'> |
PolynomialFeatures | <class 'sklearn.preprocessing._polynomial.PolynomialFeatures'> |
RBFSampler | <class 'sklearn.kernel_approximation.RBFSampler'> |
RobustScaler | <class 'sklearn.preprocessing._data.RobustScaler'> |
StandardScaler | <class 'sklearn.preprocessing._data.StandardScaler'> |
SelectFwe | <class 'sklearn.feature_selection._univariate_selection.SelectFwe'> |
SelectPercentile | <class 'sklearn.feature_selection._univariate_selection.SelectPercentile'> |
VarianceThreshold | <class 'sklearn.feature_selection._variance_threshold.VarianceThreshold'> |
SGDRegressor | <class 'sklearn.linear_model._stochastic_gradient.SGDRegressor'> |
Ridge | <class 'sklearn.linear_model._ridge.Ridge'> |
Lasso | <class 'sklearn.linear_model._coordinate_descent.Lasso'> |
ElasticNet | <class 'sklearn.linear_model._coordinate_descent.ElasticNet'> |
Lars | <class 'sklearn.linear_model._least_angle.Lars'> |
LassoLars | <class 'sklearn.linear_model._least_angle.LassoLars'> |
LassoLarsCV | <class 'sklearn.linear_model._least_angle.LassoLarsCV'> |
RidgeCV | <class 'sklearn.linear_model._ridge.RidgeCV'> |
SVR | <class 'sklearn.svm._classes.SVR'> |
LinearSVR | <class 'sklearn.svm._classes.LinearSVR'> |
AdaBoostRegressor | <class 'sklearn.ensemble._weight_boosting.AdaBoostRegressor'> |
ElasticNetCV | <class 'sklearn.linear_model._coordinate_descent.ElasticNetCV'> |
AdaBoostClassifier | <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'> |
MLPRegressor | <class 'sklearn.neural_network._multilayer_perceptron.MLPRegressor'> |
GaussianProcessRegressor | <class 'sklearn.gaussian_process._gpr.GaussianProcessRegressor'> |
HistGradientBoostingClassifier | <class 'sklearn.ensemble._hist_gradient_boosting.gradient_boosting.HistGradientBoostingClassifier'> |
HistGradientBoostingRegressor | <class 'sklearn.ensemble._hist_gradient_boosting.gradient_boosting.HistGradientBoostingRegressor'> |
AddTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.AddTransformer'> |
mul_neg_1_Transformer | <class 'tpot2.builtin_modules.arithmetictransformer.mul_neg_1_Transformer'> |
MulTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.MulTransformer'> |
SafeReciprocalTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.SafeReciprocalTransformer'> |
EQTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.EQTransformer'> |
NETransformer | <class 'tpot2.builtin_modules.arithmetictransformer.NETransformer'> |
GETransformer | <class 'tpot2.builtin_modules.arithmetictransformer.GETransformer'> |
GTTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.GTTransformer'> |
LETransformer | <class 'tpot2.builtin_modules.arithmetictransformer.LETransformer'> |
LTTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.LTTransformer'> |
MinTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.MinTransformer'> |
MaxTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.MaxTransformer'> |
ZeroTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.ZeroTransformer'> |
OneTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.OneTransformer'> |
NTransformer | <class 'tpot2.builtin_modules.arithmetictransformer.NTransformer'> |
PowerTransformer | <class 'sklearn.preprocessing._data.PowerTransformer'> |
QuantileTransformer | <class 'sklearn.preprocessing._data.QuantileTransformer'> |
ARDRegression | <class 'sklearn.linear_model._bayes.ARDRegression'> |
QuadraticDiscriminantAnalysis | <class 'sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis'> |
PassiveAggressiveClassifier | <class 'sklearn.linear_model._passive_aggressive.PassiveAggressiveClassifier'> |
LinearDiscriminantAnalysis | <class 'sklearn.discriminant_analysis.LinearDiscriminantAnalysis'> |
DominantEncoder | <class 'tpot2.builtin_modules.genetic_encoders.DominantEncoder'> |
RecessiveEncoder | <class 'tpot2.builtin_modules.genetic_encoders.RecessiveEncoder'> |
HeterosisEncoder | <class 'tpot2.builtin_modules.genetic_encoders.HeterosisEncoder'> |
UnderDominanceEncoder | <class 'tpot2.builtin_modules.genetic_encoders.UnderDominanceEncoder'> |
OverDominanceEncoder | <class 'tpot2.builtin_modules.genetic_encoders.OverDominanceEncoder'> |
GaussianProcessClassifier | <class 'sklearn.gaussian_process._gpc.GaussianProcessClassifier'> |
BaggingClassifier | <class 'sklearn.ensemble._bagging.BaggingClassifier'> |
LGBMRegressor | <class 'lightgbm.sklearn.LGBMRegressor'> |
Passthrough | <class 'tpot2.builtin_modules.passthrough.Passthrough'> |
SkipTransformer | <class 'tpot2.builtin_modules.passthrough.SkipTransformer'> |
PassKBinsDiscretizer | <class 'tpot2.builtin_modules.passkbinsdiscretizer.PassKBinsDiscretizer'> |
SimpleImputer | <class 'sklearn.impute._base.SimpleImputer'> |
IterativeImputer | <class 'sklearn.impute._iterative.IterativeImputer'> |
KNNImputer | <class 'sklearn.impute._knn.KNNImputer'> |
MDR | <class 'mdr.mdr.MDR'> |
ContinuousMDR | <class 'mdr.continuous_mdr.ContinuousMDR'> |
ReliefF | <class 'skrebate.relieff.ReliefF'> |
SURF | <class 'skrebate.surf.SURF'> |
SURFstar | <class 'skrebate.surfstar.SURFstar'> |
MultiSURF | <class 'skrebate.multisurf.MultiSURF'> |
LinearRegression_sklearnex | <class 'sklearnex.linear_model.linear.LinearRegression'> |
Ridge_sklearnex | <class 'daal4py.sklearn.linear_model._ridge.Ridge'> |
Lasso_sklearnex | <class 'daal4py.sklearn.linear_model._coordinate_descent.Lasso'> |
ElasticNet_sklearnex | <class 'daal4py.sklearn.linear_model._coordinate_descent.ElasticNet'> |
SVR_sklearnex | <class 'sklearnex.svm.svr.SVR'> |
NuSVR_sklearnex | <class 'sklearnex.svm.nusvr.NuSVR'> |
RandomForestRegressor_sklearnex | <class 'sklearnex.ensemble._forest.RandomForestRegressor'> |
KNeighborsRegressor_sklearnex | <class 'sklearnex.neighbors.knn_regression.KNeighborsRegressor'> |
RandomForestClassifier_sklearnex | <class 'sklearnex.ensemble._forest.RandomForestClassifier'> |
KNeighborsClassifier_sklearnex | <class 'sklearnex.neighbors.knn_classification.KNeighborsClassifier'> |
SVC_sklearnex | <class 'sklearnex.svm.svc.SVC'> |
NuSVC_sklearnex | <class 'sklearnex.svm.nusvc.NuSVC'> |
LogisticRegression_sklearnex | <class 'sklearnex.linear_model.logistic_regression.LogisticRegression'> |
Some methods require a wrapped estimator. To account for both regression and classification, these have been grouped separately with their own special strings.
Wrapper Special String | Notes |
---|---|
RFE_classification | FRE with learned ExtraTreesClassifier |
RFE_regression | RFE with learned ExtraTreesRegressor |
SelectFromModel_classification | SelectFromModel with learned ExtraTreesClassifier |
SelectFromModel_regression | SelectFromModel with learned ExtraTreesRegressor |
IterativeImputer_learned_estimators | IterativeImputer with learned ExtraTreesRegressor |
There are also special strings that include a predefined lists of methods. These will return a ChoicePipeline of the included methods.
List Special String | Included methods |
---|---|
"selectors" | ["SelectFwe", "SelectPercentile", "VarianceThreshold",] |
"selectors_classification" | ["SelectFwe", "SelectPercentile", "VarianceThreshold", "RFE_classification", "SelectFromModel_classification"] |
"selectors_regression" | ["SelectFwe", "SelectPercentile", "VarianceThreshold", "RFE_regression", "SelectFromModel_regression"] |
"classifiers" | ["LGBMClassifier", "BaggingClassifier", 'AdaBoostClassifier', 'BernoulliNB', 'DecisionTreeClassifier', 'ExtraTreesClassifier', 'GaussianNB', 'HistGradientBoostingClassifier', 'KNeighborsClassifier','LinearDiscriminantAnalysis', 'LogisticRegression', "LinearSVC", "SVC", 'MLPClassifier', 'MultinomialNB', "QuadraticDiscriminantAnalysis", 'RandomForestClassifier', 'SGDClassifier', 'XGBClassifier'] |
"regressors" | ["LGBMRegressor", 'AdaBoostRegressor', "ARDRegression", 'DecisionTreeRegressor', 'ExtraTreesRegressor', 'HistGradientBoostingRegressor', 'KNeighborsRegressor', 'LinearSVR', "MLPRegressor", 'RandomForestRegressor', 'SGDRegressor', 'SVR', 'XGBRegressor'] |
"transformers" | ["PassKBinsDiscretizer", "Binarizer", "PCA", "ZeroCount", "ColumnOneHotEncoder", "FastICA", "FeatureAgglomeration", "Nystroem", "RBFSampler", "QuantileTransformer", "PowerTransformer"] |
"scalers" | ["MinMaxScaler", "RobustScaler", "StandardScaler", "MaxAbsScaler", "Normalizer", ] |
"all_transformers" | ["transformers", "scalers"] |
"arithmatic" | ["AddTransformer", "mul_neg_1_Transformer", "MulTransformer", "SafeReciprocalTransformer", "EQTransformer", "NETransformer", "GETransformer", "GTTransformer", "LETransformer", "LTTransformer", "MinTransformer", "MaxTransformer"] |
"imputers" | ["SimpleImputer", "IterativeImputer", "KNNImputer"] |
"skrebate" | ["ReliefF", "SURF", "SURFstar", "MultiSURF"] |
"genetic_encoders" | ["DominantEncoder", "RecessiveEncoder", "HeterosisEncoder", "UnderDominanceEncoder", "OverDominanceEncoder"] |
"classifiers_sklearnex" | ["RandomForestClassifier_sklearnex", "LogisticRegression_sklearnex", "KNeighborsClassifier_sklearnex", "SVC_sklearnex","NuSVC_sklearnex"] |
"regressors_sklearnex" | ["LinearRegression_sklearnex", "Ridge_sklearnex", "Lasso_sklearnex", "ElasticNet_sklearnex", "SVR_sklearnex", "NuSVR_sklearnex", "RandomForestRegressor_sklearnex", "KNeighborsRegressor_sklearnex"] |
"genetic encoders" | ["DominantEncoder", "RecessiveEncoder", "HeterosisEncoder", "UnderDominanceEncoder", "OverDominanceEncoder"] |
Here are some examples of how to get search spaces using the get_search_space
function.
#same pipeline search space as before.
classifier_choice = tpot2.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"])
print("sampled pipeline 1")
classifier_choice.generate().export_pipeline()
sampled pipeline 1
KNeighborsClassifier(n_jobs=1, n_neighbors=55, weights='distance')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_jobs=1, n_neighbors=55, weights='distance')
print("sampled pipeline 2")
classifier_choice.generate().export_pipeline()
sampled pipeline 2
LogisticRegression(C=0.012915602763, l1_ratio=0.2577823332886, max_iter=1000, n_jobs=1, penalty='elasticnet', solver='saga')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.012915602763, l1_ratio=0.2577823332886, max_iter=1000, n_jobs=1, penalty='elasticnet', solver='saga')
#search space for all classifiers
classifier_choice = tpot2.config.get_search_space("classifiers")
print("sampled pipeline 1")
classifier_choice.generate().export_pipeline()
sampled pipeline 1
SGDClassifier(alpha=0.0038384092036, class_weight='balanced', eta0=0.7197535254246, l1_ratio=0.8816063677431, loss='modified_huber', n_jobs=1, penalty='elasticnet')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SGDClassifier(alpha=0.0038384092036, class_weight='balanced', eta0=0.7197535254246, l1_ratio=0.8816063677431, loss='modified_huber', n_jobs=1, penalty='elasticnet')
print("sampled pipeline 2")
classifier_choice.generate().export_pipeline()
sampled pipeline 2
KNeighborsClassifier(n_jobs=1, n_neighbors=1, p=1, weights='distance')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_jobs=1, n_neighbors=1, p=1, weights='distance')
A note on reproducibility¶
Many sklearn estimators, like RandomForestClassifier, are stochastic and require a random_state parameter in order to have deterministic results. If you want TPOT runs to be reproducible, it is important that the estimators used by TPOT have a random state set. TPOT will not automatically set this value. This can either be set manually in each search space, or by passing in the random state to the get_search_space
function. For example:
reproducible_random_forest = tpot2.config.get_search_space("RandomForestClassifier", random_state=1)
reproducible_random_forest.generate().export_pipeline()
RandomForestClassifier(bootstrap=False, criterion='entropy', max_features=0.0121463021153, min_samples_leaf=10, min_samples_split=14, n_estimators=128, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(bootstrap=False, criterion='entropy', max_features=0.0121463021153, min_samples_leaf=10, min_samples_split=14, n_estimators=128, random_state=1)
SequentialPipeline¶
SequentialPipelines are of fixed length and sample from a predefined distribution for each step.
selector_choicepipeline = tpot2.config.get_search_space("VarianceThreshold")
transformer_choicepipeline = tpot2.config.get_search_space("PCA")
classifier_choicepipeline = tpot2.config.get_search_space("LogisticRegression")
stc_pipeline = tpot2.search_spaces.pipelines.SequentialPipeline([
selector_choicepipeline,
transformer_choicepipeline,
classifier_choicepipeline,
])
print("sampled pipeline")
stc_pipeline.generate().export_pipeline()
sampled pipeline
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.0008293708451)), ('pca', PCA(n_components=0.5048643890372)), ('logisticregression', LogisticRegression(C=7.7606337566295, class_weight='balanced', l1_ratio=0.123465163557, max_iter=1000, n_jobs=1, penalty='elasticnet', solver='saga'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.0008293708451)), ('pca', PCA(n_components=0.5048643890372)), ('logisticregression', LogisticRegression(C=7.7606337566295, class_weight='balanced', l1_ratio=0.123465163557, max_iter=1000, n_jobs=1, penalty='elasticnet', solver='saga'))])
VarianceThreshold(threshold=0.0008293708451)
PCA(n_components=0.5048643890372)
LogisticRegression(C=7.7606337566295, class_weight='balanced', l1_ratio=0.123465163557, max_iter=1000, n_jobs=1, penalty='elasticnet', solver='saga')
Here is an example of the form Selector-Transformer-Classifier.
Note that each step in the sequence is a ChoicePipeline this time. Here, the SequentialPipeline can sample from search provided search space in order.
selector_choicepipeline = tpot2.config.get_search_space("selectors")
transformer_choicepipeline = tpot2.config.get_search_space("transformers")
classifier_choicepipeline = tpot2.config.get_search_space("classifiers")
stc_pipeline = tpot2.search_spaces.pipelines.SequentialPipeline([
selector_choicepipeline,
transformer_choicepipeline,
classifier_choicepipeline,
])
print("sampled pipeline")
stc_pipeline.generate().export_pipeline()
sampled pipeline
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.1215210592814)), ('fastica', FastICA(n_components=83)), ('baggingclassifier', BaggingClassifier(bootstrap_features=True, max_features=0.9057563115025, max_samples=0.2313759070451, n_estimators=89, n_jobs=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.1215210592814)), ('fastica', FastICA(n_components=83)), ('baggingclassifier', BaggingClassifier(bootstrap_features=True, max_features=0.9057563115025, max_samples=0.2313759070451, n_estimators=89, n_jobs=1))])
VarianceThreshold(threshold=0.1215210592814)
FastICA(n_components=83)
BaggingClassifier(bootstrap_features=True, max_features=0.9057563115025, max_samples=0.2313759070451, n_estimators=89, n_jobs=1)
print("sampled pipeline")
stc_pipeline.generate().export_pipeline()
sampled pipeline
Pipeline(steps=[('selectpercentile', SelectPercentile(percentile=25.1697450346144)), ('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=40, strategy='uniform')), ('lineardiscriminantanalysis', LinearDiscriminantAnalysis(shrinkage=0.755769834898, solver='eigen'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('selectpercentile', SelectPercentile(percentile=25.1697450346144)), ('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=40, strategy='uniform')), ('lineardiscriminantanalysis', LinearDiscriminantAnalysis(shrinkage=0.755769834898, solver='eigen'))])
SelectPercentile(percentile=25.1697450346144)
KBinsDiscretizer(encode='onehot-dense', n_bins=40, strategy='uniform')
LinearDiscriminantAnalysis(shrinkage=0.755769834898, solver='eigen')
DynamicLinearPipeline¶
DynamicLinearPipeline takes in a single search space and randomly samples and places estimators in a list without a predefined sequence. DynamicLinearPipeline are most often used when paired with LinearPipeline. A common strategy is to use DynamicLinearPipeline to optimize a series of preprocessing or feature engineering steps, followed by a final classifier or regressor.
import tpot2.config
linear_feature_engineering = tpot2.search_spaces.pipelines.DynamicLinearPipeline(search_space = tpot2.config.get_search_space(["all_transformers","selectors_classification"]), max_length=10)
print("sampled pipeline")
linear_feature_engineering.generate().export_pipeline()
sampled pipeline
Pipeline(steps=[('rbfsampler', RBFSampler(gamma=0.1991726671256, n_components=7)), ('zerocount', ZeroCount()), ('binarizer', Binarizer(threshold=0.5354245073766))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('rbfsampler', RBFSampler(gamma=0.1991726671256, n_components=7)), ('zerocount', ZeroCount()), ('binarizer', Binarizer(threshold=0.5354245073766))])
RBFSampler(gamma=0.1991726671256, n_components=7)
ZeroCount()
Binarizer(threshold=0.5354245073766)
print("sampled pipeline")
linear_feature_engineering.generate().export_pipeline()
sampled pipeline
Pipeline(steps=[('selectfwe', SelectFwe(alpha=0.0014251225737)), ('powertransformer', PowerTransformer())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('selectfwe', SelectFwe(alpha=0.0014251225737)), ('powertransformer', PowerTransformer())])
SelectFwe(alpha=0.0014251225737)
PowerTransformer()
full_search_space = tpot2.search_spaces.pipelines.SequentialPipeline([
linear_feature_engineering,
tpot2.config.get_search_space("classifiers"),
])
print("sampled pipeline")
full_search_space.generate().export_pipeline()
sampled pipeline
Pipeline(steps=[('pipeline', Pipeline(steps=[('nystroem', Nystroem(gamma=0.3480554902065, kernel='sigmoid', n_components=20)), ('binarizer', Binarizer(threshold=0.6696149189758)), ('minmaxscaler', MinMaxScaler())])), ('multinomialnb', MultinomialNB(alpha=0.0016967794962))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline', Pipeline(steps=[('nystroem', Nystroem(gamma=0.3480554902065, kernel='sigmoid', n_components=20)), ('binarizer', Binarizer(threshold=0.6696149189758)), ('minmaxscaler', MinMaxScaler())])), ('multinomialnb', MultinomialNB(alpha=0.0016967794962))])
Pipeline(steps=[('nystroem', Nystroem(gamma=0.3480554902065, kernel='sigmoid', n_components=20)), ('binarizer', Binarizer(threshold=0.6696149189758)), ('minmaxscaler', MinMaxScaler())])
Nystroem(gamma=0.3480554902065, kernel='sigmoid', n_components=20)
Binarizer(threshold=0.6696149189758)
MinMaxScaler()
MultinomialNB(alpha=0.0016967794962)
print("sampled pipeline")
full_search_space.generate().export_pipeline()
sampled pipeline
Pipeline(steps=[('pipeline', Pipeline(steps=[('zerocount', ZeroCount()), ('variancethreshold', VarianceThreshold(threshold=0.0020422211173)), ('binarizer', Binarizer(threshold=0.9681763702))])), ('bernoullinb', BernoulliNB(alpha=0.0816524714629, fit_prior=False))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pipeline', Pipeline(steps=[('zerocount', ZeroCount()), ('variancethreshold', VarianceThreshold(threshold=0.0020422211173)), ('binarizer', Binarizer(threshold=0.9681763702))])), ('bernoullinb', BernoulliNB(alpha=0.0816524714629, fit_prior=False))])
Pipeline(steps=[('zerocount', ZeroCount()), ('variancethreshold', VarianceThreshold(threshold=0.0020422211173)), ('binarizer', Binarizer(threshold=0.9681763702))])
ZeroCount()
VarianceThreshold(threshold=0.0020422211173)
Binarizer(threshold=0.9681763702)
BernoulliNB(alpha=0.0816524714629, fit_prior=False)
UnionPipeline¶
Union pipelines can be useful when you want to either do multiple transformations in a single layer. Another common strategy is to do a union with a transformer and a passthrough for when you want to keep the original data in addition to the transformation.
transform_and_passthrough = tpot2.search_spaces.pipelines.UnionPipeline([
tpot2.config.get_search_space("transformers"),
tpot2.config.get_search_space("Passthrough"),
])
transform_and_passthrough.generate().export_pipeline()
FeatureUnion(transformer_list=[('fastica', FastICA(algorithm='deflation', n_components=66)), ('passthrough', Passthrough())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('fastica', FastICA(algorithm='deflation', n_components=66)), ('passthrough', Passthrough())])
FastICA(algorithm='deflation', n_components=66)
Passthrough()
UnionPipelines are an excellent tool to expand the capabilities of the linear search spaces.
stc_pipeline2 = tpot2.search_spaces.pipelines.SequentialPipeline([
tpot2.config.get_search_space("selectors"),
transform_and_passthrough,
tpot2.config.get_search_space("classifiers"),
])
stc_pipeline2.generate().export_pipeline()
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.0009494718313)), ('featureunion', FeatureUnion(transformer_list=[('binarizer', Binarizer(threshold=0.8136655878085)), ('passthrough', Passthrough())])), ('adaboostclassifier', AdaBoostClassifier(learning_rate=0.1727096029044, n_estimators=446))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.0009494718313)), ('featureunion', FeatureUnion(transformer_list=[('binarizer', Binarizer(threshold=0.8136655878085)), ('passthrough', Passthrough())])), ('adaboostclassifier', AdaBoostClassifier(learning_rate=0.1727096029044, n_estimators=446))])
VarianceThreshold(threshold=0.0009494718313)
FeatureUnion(transformer_list=[('binarizer', Binarizer(threshold=0.8136655878085)), ('passthrough', Passthrough())])
Binarizer(threshold=0.8136655878085)
Passthrough()
AdaBoostClassifier(learning_rate=0.1727096029044, n_estimators=446)
Union pipelines can also be used to create "branches" if you are trying to create a tree-like search space. This can be particularly useful when paired with the FeatureSetSelector node (FSSNode) as each branch can learn different feature engineering for different subsets of the features, for example.
st_pipeline = tpot2.search_spaces.pipelines.SequentialPipeline([
tpot2.config.get_search_space("selectors"),
tpot2.config.get_search_space("transformers"),
])
branched_pipeline = tpot2.search_spaces.pipelines.SequentialPipeline([
tpot2.search_spaces.pipelines.UnionPipeline([
st_pipeline,
st_pipeline,
]),
tpot2.config.get_search_space("classifiers"),
])
branched_pipeline.generate().export_pipeline()
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.1996640297479)), ('powertransformer', PowerTransformer())])), ('pipeline-2', Pipeline(steps=[('selectfwe', SelectFwe(alpha=0.0045323854667)), ('fastica', FastICA(n_components=34))]))])), ('quadraticdiscriminantanalysis', QuadraticDiscriminantAnalysis(reg_param=0.8833282196313))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.1996640297479)), ('powertransformer', PowerTransformer())])), ('pipeline-2', Pipeline(steps=[('selectfwe', SelectFwe(alpha=0.0045323854667)), ('fastica', FastICA(n_components=34))]))])), ('quadraticdiscriminantanalysis', QuadraticDiscriminantAnalysis(reg_param=0.8833282196313))])
FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.1996640297479)), ('powertransformer', PowerTransformer())])), ('pipeline-2', Pipeline(steps=[('selectfwe', SelectFwe(alpha=0.0045323854667)), ('fastica', FastICA(n_components=34))]))])
VarianceThreshold(threshold=0.1996640297479)
PowerTransformer()
SelectFwe(alpha=0.0045323854667)
FastICA(n_components=34)
QuadraticDiscriminantAnalysis(reg_param=0.8833282196313)
DynamicUnionPipeline¶
DynamicUnionPipeline works similarly as UnionPipeline. Whereas UnionPipeline is fixed length, with each index corresponding to the search space provided as a list, DynamicUnionPipeline takes in a single search space and will sample 1 or more estimators/pipelines and concatenate them with a FeatureUnion.
Note that DynamicUnionPipeline will check for pipeline uniqueness, so it will never concatenate two completely identical pipelines. In other words, all steps within the feature union will be unique.
This can be useful when you want multiple transformers (or in some cases, pipelines), but are not sure how many or which ones.
dynamic_transformers = tpot2.search_spaces.pipelines.DynamicUnionPipeline(tpot2.config.get_search_space("transformers"), max_estimators=4)
dynamic_transformers.generate().export_pipeline()
FeatureUnion(transformer_list=[('zerocount', ZeroCount()), ('powertransformer', PowerTransformer())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('zerocount', ZeroCount()), ('powertransformer', PowerTransformer())])
ZeroCount()
PowerTransformer()
One good strategy could be to pair this with Passthrough in a feature union so that you output all the transformations along with the original data.
dynamic_transformers_with_passthrough = tpot2.search_spaces.pipelines.UnionPipeline([
dynamic_transformers,
tpot2.config.get_search_space("Passthrough")],
)
dynamic_transformers_with_passthrough.generate().export_pipeline()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('powertransformer', PowerTransformer())])), ('passthrough', Passthrough())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('powertransformer', PowerTransformer())])), ('passthrough', Passthrough())])
PowerTransformer()
Passthrough()
stc_pipeline3 = tpot2.search_spaces.pipelines.SequentialPipeline([
tpot2.config.get_search_space("selectors"),
dynamic_transformers_with_passthrough,
tpot2.config.get_search_space("classifiers"),
])
stc_pipeline3.generate().export_pipeline()
Pipeline(steps=[('selectpercentile', SelectPercentile(percentile=3.5688237635159)), ('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('featureagglomeration', FeatureAgglomeration(n_clusters=28, pooling_func=<function max at 0x78ec455b4e30>))])), ('passthrough', Passthrough())])), ('logisticregression', LogisticRegression(C=9762.07332929782, max_iter=1000, n_jobs=1, solver='saga'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('selectpercentile', SelectPercentile(percentile=3.5688237635159)), ('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('featureagglomeration', FeatureAgglomeration(n_clusters=28, pooling_func=<function max at 0x78ec455b4e30>))])), ('passthrough', Passthrough())])), ('logisticregression', LogisticRegression(C=9762.07332929782, max_iter=1000, n_jobs=1, solver='saga'))])
SelectPercentile(percentile=3.5688237635159)
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('featureagglomeration', FeatureAgglomeration(n_clusters=28, pooling_func=<function max at 0x78ec455b4e30>))])), ('passthrough', Passthrough())])
FeatureAgglomeration(n_clusters=28, pooling_func=<function max at 0x78ec455b4e30>)
Passthrough()
LogisticRegression(C=9762.07332929782, max_iter=1000, n_jobs=1, solver='saga')
WrapperPipeline¶
Some sklearn estimators take in other sklearn estimators as a parameter. The wrapper pipeline is used to tune both the original estimators hyperparameters simultaneously with the inner estimators hyperparameters. In fact, the inner estimator in WrapperPipeline can be any search space defined with any of the methods described in this Tutorial.
The get_search_space
will automatically create an inner search space for sklearn estimators that do use require an inner estimator. For example "SelectFromModel_classification" will return the following search space
SelectFromModel_configspace_part = ConfigurationSpace(
space = {
'threshold': Float('threshold', bounds=(1e-4, 1.0), log=True),
}
)
extratrees_estimator_node = tpot2.config.get_search_space("ExtraTreesClassifier") #this exports an ExtraTreesClassifier node
extratrees_estimator_node.generate().export_pipeline()
ExtraTreesClassifier(class_weight='balanced', max_features=0.6642237575313, min_samples_leaf=17, min_samples_split=3, n_jobs=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ExtraTreesClassifier(class_weight='balanced', max_features=0.6642237575313, min_samples_leaf=17, min_samples_split=3, n_jobs=1)
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
select_from_model_wrapper_searchspace = tpot2.search_spaces.pipelines.WrapperPipeline(
method=SelectFromModel,
space = SelectFromModel_configspace_part,
estimator_search_space= extratrees_estimator_node,
)
select_from_model_wrapper_searchspace.generate().export_pipeline()
SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=True, class_weight='balanced', max_features=0.3007313724684, min_samples_leaf=12, min_samples_split=17, n_jobs=1), threshold=0.0048046738992)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SelectFromModel(estimator=ExtraTreesClassifier(bootstrap=True, class_weight='balanced', max_features=0.3007313724684, min_samples_leaf=12, min_samples_split=17, n_jobs=1), threshold=0.0048046738992)
ExtraTreesClassifier(bootstrap=True, class_weight='balanced', max_features=0.3007313724684, min_samples_leaf=12, min_samples_split=17, n_jobs=1)
ExtraTreesClassifier(bootstrap=True, class_weight='balanced', max_features=0.3007313724684, min_samples_leaf=12, min_samples_split=17, n_jobs=1)
WrapperPipeline strategy for ensembles/inner classifiers and regressors (EstimatorTransformer)¶
Sklearn Pipelines only allow classifiers/regressors as the final step. All other steps are expected to implement a transform function. We can get around this by wrapping it in another transformer class that returns the output of predict or predict_proba inside the transform() function.
To wrap classifiers as transfomers, you can use the following class: tpot2.builtin_modules.EstimatorTransformer
. You can specify whether to pass the outputs of predict, predict_proba, or decision function with the method
parameter.
cross_val_predict_cv¶
An additional consideration is whether or not to use cross_val_predict_cv
. If this parameter is set, during model training any classifiers or regressors that is not the final predictor will use sklearn.model_selection.cross_val_predict
to pass out of sample predictions into the following steps of the model. The model will still be fit to the full data which will be used for predictions after training. Training downstream models on out of sample predictions can often prevent overfitting and increase performance. The reason is that this gives downstream models a estimate of how upstream models compare on unseen data. Otherwise, if an upsteam model heavily overfits the data, downsteam models may simply learn to blindly trust the seemingly well-predicting model, propagating the over-fitting through to the end result.
The downside is that cross_val_predict_cv is significantly more computationally demanding, and may not be necessary for your given dataset.
Note: This is not necessary for GraphSearchPipeline
as the exported GraphPipeline estimator does have builtin support for inner/regressors. Instead of using a wrapper, you can set the cross_val_predict_cv
param when initializing the GraphSearchPipeline
object.
classifiers = tpot2.config.get_search_space("classifiers")
wrapped_estimators = tpot2.search_spaces.pipelines.WrapperPipeline(tpot2.builtin_modules.EstimatorTransformer, {}, classifiers)
est = wrapped_estimators.generate().export_pipeline() #returns an estimator with a transform function
est
EstimatorTransformer(estimator=SVC(C=140.9223338924506, gamma=0.0007253447995, max_iter=3000, probability=True, shrinking=False))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
EstimatorTransformer(estimator=SVC(C=140.9223338924506, gamma=0.0007253447995, max_iter=3000, probability=True, shrinking=False))
SVC(C=140.9223338924506, gamma=0.0007253447995, max_iter=3000, probability=True, shrinking=False)
SVC(C=140.9223338924506, gamma=0.0007253447995, max_iter=3000, probability=True, shrinking=False)
import numpy as np
X, y = np.random.rand(100, 10), np.random.randint(0, 2, 100)
est.fit_transform(X, y)[0:5]
array([[0.5 , 0.5 ], [0.50964815, 0.49035185], [0.50681558, 0.49318442], [0.51565809, 0.48434191], [0.52006004, 0.47993996]])
you can manually set the settings for an estimator the same way you would do it for an EstimatorNode. Here's another example with cross_val_predict and method being used.
classifiers = tpot2.config.get_search_space("classifiers")
wrapped_estimators_cv = tpot2.search_spaces.pipelines.WrapperPipeline(tpot2.builtin_modules.EstimatorTransformer, {'cross_val_predict_cv':10, 'method':'predict'}, classifiers)
est = wrapped_estimators_cv.generate().export_pipeline() #returns an estimator with a transform function
est.fit_transform(X, y)[0:5]
array([[0], [0], [1], [1], [1]])
These can now be used inside a linear pipeline. This is fairly similar to the default linear pipeline search space.
dynamic_wrapped_classifiers_with_passthrough = tpot2.search_spaces.pipelines.UnionPipeline([
tpot2.search_spaces.pipelines.DynamicUnionPipeline(wrapped_estimators_cv, max_estimators=4),
tpot2.config.get_search_space("Passthrough")
])
stc_pipeline4 = tpot2.search_spaces.pipelines.SequentialPipeline([
tpot2.config.get_search_space("scalers"),
dynamic_transformers_with_passthrough,
dynamic_wrapped_classifiers_with_passthrough,
tpot2.config.get_search_space("classifiers"),
])
stc_pipeline4.generate().export_pipeline()
Pipeline(steps=[('normalizer', Normalizer(norm='max')), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('rbfsampler', RBFSampler(gamma=0.7809991844556, n_components=50)), ('columnonehotencoder', ColumnOneHotEncoder()), ('nystroem', Nystroem(gamma=0.3179172515929, kernel='additive_chi2', n_components=80))])), ('... class_weight='balanced', eta0=0.4039854095517, l1_ratio=0.0336982783886, learning_rate='constant', loss='modified_huber', n_jobs=1, penalty='elasticnet'), method='predict'))])), ('passthrough', Passthrough())])), ('mlpclassifier', MLPClassifier(alpha=0.0867902302825, hidden_layer_sizes=[35], learning_rate='invscaling', learning_rate_init=0.0152961651727, n_iter_no_change=32))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('normalizer', Normalizer(norm='max')), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('rbfsampler', RBFSampler(gamma=0.7809991844556, n_components=50)), ('columnonehotencoder', ColumnOneHotEncoder()), ('nystroem', Nystroem(gamma=0.3179172515929, kernel='additive_chi2', n_components=80))])), ('... class_weight='balanced', eta0=0.4039854095517, l1_ratio=0.0336982783886, learning_rate='constant', loss='modified_huber', n_jobs=1, penalty='elasticnet'), method='predict'))])), ('passthrough', Passthrough())])), ('mlpclassifier', MLPClassifier(alpha=0.0867902302825, hidden_layer_sizes=[35], learning_rate='invscaling', learning_rate_init=0.0152961651727, n_iter_no_change=32))])
Normalizer(norm='max')
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('rbfsampler', RBFSampler(gamma=0.7809991844556, n_components=50)), ('columnonehotencoder', ColumnOneHotEncoder()), ('nystroem', Nystroem(gamma=0.3179172515929, kernel='additive_chi2', n_components=80))])), ('passthrough', Passthrough())])
RBFSampler(gamma=0.7809991844556, n_components=50)
ColumnOneHotEncoder()
Nystroem(gamma=0.3179172515929, kernel='additive_chi2', n_components=80)
Passthrough()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('estimatortransformer-1', EstimatorTransformer(cross_val_predict_cv=10, estimator=BaggingClassifier(bootstrap=False, bootstrap_features=True, max_features=0.3230075838011, max_samples=0.5802749777364, n_estimators=8, n_jobs=1), method='predict')), ('estimatortransformer-2', Estima... min_samples_split=10, n_jobs=1), method='predict')), ('estimatortransformer-3', EstimatorTransformer(cross_val_predict_cv=10, estimator=SGDClassifier(alpha=0.0009170388361, class_weight='balanced', eta0=0.4039854095517, l1_ratio=0.0336982783886, learning_rate='constant', loss='modified_huber', n_jobs=1, penalty='elasticnet'), method='predict'))])), ('passthrough', Passthrough())])
BaggingClassifier(bootstrap=False, bootstrap_features=True, max_features=0.3230075838011, max_samples=0.5802749777364, n_estimators=8, n_jobs=1)
BaggingClassifier(bootstrap=False, bootstrap_features=True, max_features=0.3230075838011, max_samples=0.5802749777364, n_estimators=8, n_jobs=1)
ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=0.372253059993, min_samples_leaf=2, min_samples_split=10, n_jobs=1)
ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=0.372253059993, min_samples_leaf=2, min_samples_split=10, n_jobs=1)
SGDClassifier(alpha=0.0009170388361, class_weight='balanced', eta0=0.4039854095517, l1_ratio=0.0336982783886, learning_rate='constant', loss='modified_huber', n_jobs=1, penalty='elasticnet')
SGDClassifier(alpha=0.0009170388361, class_weight='balanced', eta0=0.4039854095517, l1_ratio=0.0336982783886, learning_rate='constant', loss='modified_huber', n_jobs=1, penalty='elasticnet')
Passthrough()
MLPClassifier(alpha=0.0867902302825, hidden_layer_sizes=[35], learning_rate='invscaling', learning_rate_init=0.0152961651727, n_iter_no_change=32)
GraphSearchPipeline¶
The GraphSearchPipeline is a flexible search space without a prior restriction of pipeline structure. With GraphSearchPipeline, TPOT will create a pipeline in the shape of a directed acyclic graph. Throughout the optimization process, TPOT may add/remove nodes, add/remove edges, and performs model selection and hyperparameter tuning for each node.
The primary parameters for the graph_search_space are the root_search_space, inner_search_space, and leaf_search_space.
Parameter | Type | Description |
---|---|---|
root_search_space | SklearnIndividualGenerator | The search space for the root node of the graph. This node will be the final estimator in the pipeline. |
inner_search_space | SklearnIndividualGenerator, optional | The search space for the inner nodes of the graph. If not defined, there will be no inner nodes. |
leaf_search_space | SklearnIndividualGenerator, optional | The search space for the leaf nodes of the graph. If not defined, the leaf nodes will be drawn from the inner_search_space. |
crossover_same_depth | bool, optional | If True, crossover will only occur between nodes at the same depth in the graph. If False, crossover will occur between nodes at any depth. |
cross_val_predict_cv | int, cross-validation generator or an iterable, optional | Determines the cross-validation splitting strategy used in inner classifiers or regressors. |
method | str, optional | The prediction method to use for the inner classifiers or regressors. If 'auto', it will try to use predict_proba, decision_function, or predict in that order. |
This search space exports a tpot2.GraphPipeline
. This is similar to a scikit-learn Pipeline, but for directed acyclic graph pipelines. You can learn more about using this module in Tutorial 6.
graph_search_space = tpot2.search_spaces.pipelines.GraphSearchPipeline(
root_search_space= tpot2.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
leaf_search_space = tpot2.config.get_search_space("selectors"),
inner_search_space = tpot2.config.get_search_space(["transformers"]),
max_size = 10,
)
ind = graph_search_space.generate()
est1 = ind.export_pipeline()
est1.plot() #GraphPipelines have a helpful plotting function to visualize the pipeline
Lets add a few more mutations and plot the final pipeline to get a sense of the diversity of pipelines that can be generated with this search space
for i in range(0,50):
ind.mutate()
if i%5==0:
est = ind.export_pipeline()
est.plot()
TreePipeline¶
TreePipelines work the same way as GraphPipelines, but they are limited to a tree structure. This is similar to the search space in the original TPOT.
(This search space is still experimental and currently built off GraphSearchPipeline. It may be rewritten with its own code in the future.)
tree_search_space = tpot2.search_spaces.pipelines.TreePipeline(
root_search_space= tpot2.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
leaf_search_space = tpot2.config.get_search_space("selectors"),
inner_search_space = tpot2.config.get_search_space(["transformers"]),
max_size = 10,
)
ind = graph_search_space.generate()
exp = ind.export_pipeline()
exp.plot()
Tips and Tricks¶
- Two very helpful transformers to use with search spaces are
tpot2.buildin_models.Passthrough
andtpot2.builtin_models.SkipTransformer
. Passthrough will simply pass through the exact inputs it receives into the next step. This is particularly useful inside UnionSearchSpace as it allows for both the transformed data as well as the original data to be passed into the next step. SkipTransformer will always return nothing. This is helpful when inside a union with Passthrough and an optional second method. For example, if you are unsure of whether or not you will need a transformer, you can have SkipTransformer be one option that will skip the transformation step if selected.
In this example, the FeatureUnion layer will always have at least one transformer selected and will always have one passthrough
from tpot2.search_spaces.pipelines import *
from tpot2.config import get_search_space
#This FeatureUnion layer will always have at least one transformer selected and will always have one passthrough
transformers_with_passthrough = UnionPipeline([
DynamicUnionPipeline(get_search_space(["transformers"])),
get_search_space("Passthrough")
]
)
transformers_with_passthrough.generate().export_pipeline()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=9, strategy='uniform')), ('quantiletransformer', QuantileTransformer(n_quantiles=697))])), ('passthrough', Passthrough())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('kbinsdiscretizer', KBinsDiscretizer(encode='onehot-dense', n_bins=9, strategy='uniform')), ('quantiletransformer', QuantileTransformer(n_quantiles=697))])), ('passthrough', Passthrough())])
KBinsDiscretizer(encode='onehot-dense', n_bins=9, strategy='uniform')
QuantileTransformer(n_quantiles=697)
Passthrough()
In this example, the FeatureUnion layer will always one passthrough. In addition, it may select one or more transformer, but it may skip transformers altogether and only include a Passthrough.
final_transformers_layer =UnionPipeline([
ChoicePipeline([
DynamicUnionPipeline(get_search_space(["transformers"])),
get_search_space("SkipTransformer"),
]),
get_search_space("Passthrough")
]
)
final_transformers_layer.generate().export_pipeline()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=842))])), ('passthrough', Passthrough())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=842))])), ('passthrough', Passthrough())])
QuantileTransformer(n_quantiles=842)
Passthrough()
inner_estimators_layer = UnionPipeline([
ChoicePipeline([
DynamicUnionPipeline(wrapped_estimators, max_estimators=4),
get_search_space("SkipTransformer"),
]),
get_search_space("Passthrough")]
)
inner_estimators_layer.generate().export_pipeline()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('estimatortransformer-1', EstimatorTransformer(estimator=LogisticRegression(C=3553.613707181859, max_iter=1000, n_jobs=1, solver='saga'))), ('estimatortransformer-2', EstimatorTransformer(estimator=GaussianNB())), ('estimatortransformer-3', EstimatorTransformer(estimator=MultinomialNB(alpha=0.0128552259108, fit_prior=False)))])), ('passthrough', Passthrough())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('estimatortransformer-1', EstimatorTransformer(estimator=LogisticRegression(C=3553.613707181859, max_iter=1000, n_jobs=1, solver='saga'))), ('estimatortransformer-2', EstimatorTransformer(estimator=GaussianNB())), ('estimatortransformer-3', EstimatorTransformer(estimator=MultinomialNB(alpha=0.0128552259108, fit_prior=False)))])), ('passthrough', Passthrough())])
LogisticRegression(C=3553.613707181859, max_iter=1000, n_jobs=1, solver='saga')
LogisticRegression(C=3553.613707181859, max_iter=1000, n_jobs=1, solver='saga')
GaussianNB()
GaussianNB()
MultinomialNB(alpha=0.0128552259108, fit_prior=False)
MultinomialNB(alpha=0.0128552259108, fit_prior=False)
Passthrough()
final_linear_pipeline = SequentialPipeline([
get_search_space("scalers"),
final_transformers_layer,
inner_estimators_layer,
get_search_space("classifiers"),
])
final_linear_pipeline.generate().export_pipeline()
Pipeline(steps=[('normalizer', Normalizer(norm='max')), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('baggingclassifier', BaggingClassifier(bootstrap_features=True, max_features=0.6083887402217, max_samples=0.440010144908, n_estimators=24, n_jobs=1, oob_score=True))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('normalizer', Normalizer(norm='max')), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('baggingclassifier', BaggingClassifier(bootstrap_features=True, max_features=0.6083887402217, max_samples=0.440010144908, n_estimators=24, n_jobs=1, oob_score=True))])
Normalizer(norm='max')
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())])
ColumnOneHotEncoder()
Passthrough()
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])
SkipTransformer()
Passthrough()
BaggingClassifier(bootstrap_features=True, max_features=0.6083887402217, max_samples=0.440010144908, n_estimators=24, n_jobs=1, oob_score=True)
Template Search Spaces¶
As mentioned in Tutorial 1, TPOT has several buildin search spaces. Here is the same table:
String | Description |
---|---|
linear | A linear pipeline with the structure of "Selector->(transformers+Passthrough)->(classifiers/regressors+Passthrough)->final classifier/regressor." For both the transformer and inner estimator layers, TPOT may choose one or more transformers/classifiers, or it may choose none. The inner classifier/regressor layer is optional. |
linear-light | Same search space as linear, but without the inner classifier/regressor layer and with a reduced set of faster running estimators. |
graph | TPOT will optimize a pipeline in the shape of a directed acyclic graph. The nodes of the graph can include selectors, scalers, transformers, or classifiers/regressors (inner classifiers/regressors can optionally be not included). This will return a custom GraphPipeline rather than an sklearn Pipeline. More details in Tutorial 6. |
graph-light | Same as graph search space, but without the inner classifier/regressors and with a reduced set of faster running estimators. |
mdr | TPOT will search over a series of feature selectors and Multifactor Dimensionality Reduction models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for genome-wide association studies (GWAS), and is described in detail online here. |
Rather than create your own search space, you can simply pass the string into the search_space
param. Alternatively, you can access tpot2.config.template_search_spaces.get_template_search_spaces
directly which offers a few more customizable options for each template including cross_val_predict_cv
and whether or not stacked classifiers/regressors are allowed. Or you can copy the code and customize it manually!
`tpot2.config.template_search_spaces.get_template_search_spaces`
Returns a search space which can be optimized by TPOT.
Parameters
----------
search_space: str or SearchSpace
The default search space to use. If a string, it should be one of the following:
- 'linear': A search space for linear pipelines
- 'linear-light': A search space for linear pipelines with a smaller, faster search space
- 'graph': A search space for graph pipelines
- 'graph-light': A search space for graph pipelines with a smaller, faster search space
- 'mdr': A search space for MDR pipelines
If a SearchSpace object, it should be a valid search space object for TPOT.
classification: bool, default=True
Whether the problem is a classification problem or a regression problem.
inner_predictors: bool, default=None
Whether to include additional classifiers/regressors before the final classifier/regressor (allowing for ensembles).
Defaults to False for 'linear-light' and 'graph-light' search spaces, and True otherwise. (Not used for 'mdr' search space)
cross_val_predict_cv: int, default=None
The number of folds to use for cross_val_predict.
Defaults to 0 for 'linear-light' and 'graph-light' search spaces, and 5 otherwise. (Not used for 'mdr' search space)
get_search_space_params: dict
Additional parameters to pass to the get_search_space function.
linear_search_space = tpot2.config.template_search_spaces.get_template_search_spaces("linear", inner_predictors=True, cross_val_predict_cv=5)
linear_search_space.generate().export_pipeline()
Pipeline(steps=[('passthrough', Passthrough()), ('variancethreshold', VarianceThreshold(threshold=0.0014368451974)), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('powertransformer', PowerTransformer()), ('nystroem', Nystroem(gamma=0.8842695866347, kernel='sigmoid', n_components=7))])), ('passthrough', Passth... FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('estimatortransformer', EstimatorTransformer(cross_val_predict_cv=5, estimator=BaggingClassifier(bootstrap=False, max_features=0.2031842311627, max_samples=0.4743985327407, n_estimators=89, n_jobs=1)))])), ('passthrough', Passthrough())])), ('bernoullinb', BernoulliNB(alpha=4.2777686142181))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('passthrough', Passthrough()), ('variancethreshold', VarianceThreshold(threshold=0.0014368451974)), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('powertransformer', PowerTransformer()), ('nystroem', Nystroem(gamma=0.8842695866347, kernel='sigmoid', n_components=7))])), ('passthrough', Passth... FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('estimatortransformer', EstimatorTransformer(cross_val_predict_cv=5, estimator=BaggingClassifier(bootstrap=False, max_features=0.2031842311627, max_samples=0.4743985327407, n_estimators=89, n_jobs=1)))])), ('passthrough', Passthrough())])), ('bernoullinb', BernoulliNB(alpha=4.2777686142181))])
Passthrough()
VarianceThreshold(threshold=0.0014368451974)
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('powertransformer', PowerTransformer()), ('nystroem', Nystroem(gamma=0.8842695866347, kernel='sigmoid', n_components=7))])), ('passthrough', Passthrough())])
PowerTransformer()
Nystroem(gamma=0.8842695866347, kernel='sigmoid', n_components=7)
Passthrough()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('estimatortransformer', EstimatorTransformer(cross_val_predict_cv=5, estimator=BaggingClassifier(bootstrap=False, max_features=0.2031842311627, max_samples=0.4743985327407, n_estimators=89, n_jobs=1)))])), ('passthrough', Passthrough())])
BaggingClassifier(bootstrap=False, max_features=0.2031842311627, max_samples=0.4743985327407, n_estimators=89, n_jobs=1)
BaggingClassifier(bootstrap=False, max_features=0.2031842311627, max_samples=0.4743985327407, n_estimators=89, n_jobs=1)
Passthrough()
BernoulliNB(alpha=4.2777686142181)
linear_search_space = tpot2.config.template_search_spaces.get_template_search_spaces("linear", inner_predictors=True, cross_val_predict_cv=5)
linear_est = tpot2.TPOTEstimator(
search_space = linear_search_space,
scorers=['roc_auc_ovr',tpot2.objectives.complexity_scorer],
scorers_weights=[1,-1],
classification=True,
verbose=1,
)
#alternatively, you can use the template search space to generate a pipeline
linear_est = tpot2.TPOTEstimator(
search_space = "linear",
scorers=['roc_auc_ovr',tpot2.objectives.complexity_scorer],
scorers_weights=[1,-1],
n_jobs=32,
classification=True,
verbose=1,
)
Optimize Search Space with TPOTEstimator¶
Once you have constructed a search space, you can use TPOTEstimator to optimize a pipeline within that space. Simply pass that search space into the search_space
parameter. Here is a cell where you can select different search spaces that we created in this tutorial.
all_search_spaces ={
"classifiers_only" : classifier_choice,
"stc_pipeline" : stc_pipeline,
"stc_pipeline2": stc_pipeline2,
"stc_pipeline3": stc_pipeline3,
"stc_pipeline4": stc_pipeline4,
"final_linear_pipeline": final_linear_pipeline,
"graph_pipeline": graph_search_space,
}
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.5)
selected_search_space = all_search_spaces["stc_pipeline"] #change this to select a different search space
est = tpot2.TPOTEstimator(
scorers=["roc_auc_ovr", tpot2.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
classification = True,
cv = 5,
search_space = selected_search_space,
max_time_mins=10,
max_eval_time_mins = 10,
early_stop = 2,
verbose = 2,
n_jobs=4,
)
est.fit(X_train, y_train)
Generation: : 8it [01:44, 13.07s/it] /home/perib/miniconda3/envs/myenv/lib/python3.10/site-packages/sklearn/preprocessing/_data.py:2785: UserWarning: n_quantiles (911) is greater than the total number of samples (284). n_quantiles is set to n_samples. warnings.warn(
TPOTEstimator(classification=True, cv=5, early_stop=2, max_time_mins=10, n_jobs=4, scorers=['roc_auc_ovr', <function complexity_scorer at 0x78eb3afa4160>], scorers_weights=[1.0, -1.0], search_space=<tpot2.search_spaces.pipelines.sequential.SequentialPipeline object at 0x78eb39022d10>, verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
TPOTEstimator(classification=True, cv=5, early_stop=2, max_time_mins=10, n_jobs=4, scorers=['roc_auc_ovr', <function complexity_scorer at 0x78eb3afa4160>], scorers_weights=[1.0, -1.0], search_space=<tpot2.search_spaces.pipelines.sequential.SequentialPipeline object at 0x78eb39022d10>, verbose=2)
# score the model
auroc_scorer = sklearn.metrics.get_scorer("roc_auc")
auroc_score = auroc_scorer(est, X_test, y_test)
print("auroc score", auroc_score)
auroc score 0.9899335933382524
#plot the best pipeline
if isinstance(est.fitted_pipeline_, tpot2.GraphPipeline):
est.fitted_pipeline_.plot()
est.fitted_pipeline_
Pipeline(steps=[('selectfwe', SelectFwe(alpha=0.0336222333869)), ('quantiletransformer', QuantileTransformer(n_quantiles=911, output_distribution='normal')), ('quadraticdiscriminantanalysis', QuadraticDiscriminantAnalysis(reg_param=0.3209042101754))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('selectfwe', SelectFwe(alpha=0.0336222333869)), ('quantiletransformer', QuantileTransformer(n_quantiles=911, output_distribution='normal')), ('quadraticdiscriminantanalysis', QuadraticDiscriminantAnalysis(reg_param=0.3209042101754))])
SelectFwe(alpha=0.0336222333869)
QuantileTransformer(n_quantiles=911, output_distribution='normal')
QuadraticDiscriminantAnalysis(reg_param=0.3209042101754)
Transformer-only pipelines - imputation optimization example¶
Pipelines don't necessarily need to end in a classifier or regressor. Transformer only pipelines are possible as long as you have a custom objective function to match.
import sklearn
import sklearn.datasets
import numpy as np
import tpot2
#in practice, cross validation is likely better, but this simple example is fine for demonstration purposes
def rmse_obective(est, X, missing_add=.2, rng=1, fitted=False):
rng = np.random.default_rng(rng)
X_missing = X.copy()
missing_idx = rng.random(X.shape) < missing_add
X_missing[missing_idx] = np.nan
if not fitted:
est.fit(X_missing)
X_filled = est.transform(X_missing)
return np.sqrt(np.mean((X_filled[missing_idx] - X[missing_idx])**2))
from sklearn.impute import SimpleImputer
X, y = sklearn.datasets.load_diabetes(return_X_y=True)
imp = SimpleImputer(strategy="mean")
rmse_obective(imp, X)
0.04690299241236334
import tpot2.search_spaces
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
#set up an imputation search space that includes simple imputer, knn imputer, and iterative imputer (with an optimized ExtraTreesRegressor)
simple_imputer = tpot2.config.get_search_space("SimpleImputer")
knn_imputer = tpot2.config.get_search_space("KNNImputer")
space = ConfigurationSpace({ 'initial_strategy' : Categorical('initial_strategy',
['mean', 'median',
'most_frequent', 'constant']),
'n_nearest_features' : Integer('n_nearest_features',
bounds=(1, X.shape[1])),
'imputation_order' : Categorical('imputation_order',
['ascending', 'descending',
'roman', 'arabic', 'random']),
})
# This optimizes both the iterative imputer parameters and the ExtraTreesRegressor parameters
iterative_imputer_sp = tpot2.search_spaces.pipelines.WrapperPipeline(
method = sklearn.impute.IterativeImputer,
space = space,
estimator_search_space = tpot2.config.get_search_space("ExtraTreesRegressor"),
)
#this is equivalent to
# iterative_imputer_sp = tpot2.config.get_search_space("IterativeImputer_learned_estimators")
imputation_search_space = tpot2.search_spaces.pipelines.ChoicePipeline(
search_spaces = [simple_imputer, knn_imputer, iterative_imputer_sp],
)
imputation_search_space.generate().export_pipeline()
KNNImputer(n_neighbors=99, weights='distance')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNNImputer(n_neighbors=99, weights='distance')
from functools import partial
final_objective = partial(rmse_obective, X=X, missing_add=.2)
est = tpot2.TPOTEstimator(
scorers = [],
scorers_weights = [],
other_objective_functions = [final_objective],
other_objective_functions_weights = [-1],
objective_function_names = ["rmse"],
classification = True,
search_space = imputation_search_space,
max_time_mins=10,
max_eval_time_mins = 60*5,
verbose = 3,
early_stop = 2,
n_jobs=20,
)
est.fit(X, y=y)
/home/perib/Projects/common/Projects/TPOT_Dev/tpot2/tpot2/tpot_estimator/estimator.py:504: UserWarning: Labels are not encoded as ints from 0 to N. For compatibility with some classifiers such as sklearn, TPOT has encoded y with the sklearn LabelEncoder. When using pipelines outside the main TPOT estimator class, you can encode the labels with est.label_encoder_ warnings.warn("Labels are not encoded as ints from 0 to N. For compatibility with some classifiers such as sklearn, TPOT has encoded y with the sklearn LabelEncoder. When using pipelines outside the main TPOT estimator class, you can encode the labels with est.label_encoder_") Generation: : 1it [00:24, 24.65s/it]
Generation: 1 Best rmse score: 0.034633208054417206
Generation: : 2it [00:47, 23.42s/it]
Generation: 2 Best rmse score: 0.034633208054417206
Generation: : 3it [01:12, 24.23s/it]
Generation: 3 Best rmse score: 0.03429318271103084
Generation: : 3it [01:40, 33.47s/it]
Generation: 4 Best rmse score: 0.03429318271103084 Early stop
TPOTEstimator(classification=True, early_stop=2, max_eval_time_mins=300, max_time_mins=10, n_jobs=20, objective_function_names=['rmse'], other_objective_functions=[functools.partial(<function rmse_obective at 0x78eb3890c700>, X=array([[ 0.03807591, 0.05068012, 0.06169621, ..., -0.00259226, 0.01990749, -0.01764613], [-0.00188202, -0.04464164, -0.05147406, ..., -0.0394933... -0.04688253, 0.01549073], [-0.04547248, -0.04464164, 0.03906215, ..., 0.02655962, 0.04452873, -0.02593034], [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338, -0.00422151, 0.00306441]]), missing_add=0.2)], other_objective_functions_weights=[-1], scorers=[], scorers_weights=[], search_space=<tpot2.search_spaces.pipelines.choice.ChoicePipeline object at 0x78eb37c9f250>, verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
TPOTEstimator(classification=True, early_stop=2, max_eval_time_mins=300, max_time_mins=10, n_jobs=20, objective_function_names=['rmse'], other_objective_functions=[functools.partial(<function rmse_obective at 0x78eb3890c700>, X=array([[ 0.03807591, 0.05068012, 0.06169621, ..., -0.00259226, 0.01990749, -0.01764613], [-0.00188202, -0.04464164, -0.05147406, ..., -0.0394933... -0.04688253, 0.01549073], [-0.04547248, -0.04464164, 0.03906215, ..., 0.02655962, 0.04452873, -0.02593034], [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338, -0.00422151, 0.00306441]]), missing_add=0.2)], other_objective_functions_weights=[-1], scorers=[], scorers_weights=[], search_space=<tpot2.search_spaces.pipelines.choice.ChoicePipeline object at 0x78eb37c9f250>, verbose=3)
# score the model
rmse_score = final_objective(est, fitted=True)
print("final rmse score", rmse_score)
final rmse score 0.028453289651831883
est.fitted_pipeline_
IterativeImputer(estimator=ExtraTreesRegressor(max_features=0.7116178998798, min_samples_split=16), imputation_order='descending', initial_strategy='median', n_nearest_features=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
IterativeImputer(estimator=ExtraTreesRegressor(max_features=0.7116178998798, min_samples_split=16), imputation_order='descending', initial_strategy='median', n_nearest_features=10)
ExtraTreesRegressor(max_features=0.7116178998798, min_samples_split=16)
ExtraTreesRegressor(max_features=0.7116178998798, min_samples_split=16)
Combined Search Space Example¶
from tpot2.search_spaces.pipelines import *
from tpot2.config import get_search_space
selectors = get_search_space(["selectors_classification", "Passthrough"])
estimators = get_search_space(["classifiers"])
# this allows us to wrap the classifiers in the EstimatorTransformer
# this is necessary so that classifiers can be used inside of sklearn pipelines
wrapped_estimators = WrapperPipeline(tpot2.builtin_modules.EstimatorTransformer, {}, estimators)
scalers = get_search_space(["scalers","Passthrough"])
transformers_layer =UnionPipeline([
ChoicePipeline([
DynamicUnionPipeline(get_search_space(["transformers"])),
get_search_space("SkipTransformer"),
]),
get_search_space("Passthrough")
]
)
inner_estimators_layer = UnionPipeline([
ChoicePipeline([
DynamicUnionPipeline(wrapped_estimators),
get_search_space("SkipTransformer"),
]),
get_search_space("Passthrough")]
)
search_space = SequentialPipeline(search_spaces=[
scalers,
selectors,
transformers_layer,
inner_estimators_layer,
estimators,
])
est = tpot2.TPOTEstimator(
scorers = ["roc_auc"],
scorers_weights = [1],
classification = True,
cv = 5,
search_space = search_space,
max_time_mins=10,
max_eval_time_mins = 60*5,
verbose = 2,
n_jobs=20,
)
est.fit(X_train, y_train)
Generation: : 3it [01:30, 30.21s/it] /home/perib/miniconda3/envs/myenv/lib/python3.10/site-packages/sklearn/discriminant_analysis.py:947: UserWarning: Variables are collinear warnings.warn("Variables are collinear") /home/perib/miniconda3/envs/myenv/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:690: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
TPOTEstimator(classification=True, cv=5, early_stop=2, max_eval_time_mins=300, max_time_mins=10, n_jobs=20, scorers=['roc_auc'], scorers_weights=[1], search_space=<tpot2.search_spaces.pipelines.sequential.SequentialPipeline object at 0x78eb2654aec0>, verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
TPOTEstimator(classification=True, cv=5, early_stop=2, max_eval_time_mins=300, max_time_mins=10, n_jobs=20, scorers=['roc_auc'], scorers_weights=[1], search_space=<tpot2.search_spaces.pipelines.sequential.SequentialPipeline object at 0x78eb2654aec0>, verbose=2)
est.fitted_pipeline_
Pipeline(steps=[('standardscaler', StandardScaler()), ('passthrough', Passthrough()), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_... ('estimatortransformer-2', EstimatorTransformer(estimator=DecisionTreeClassifier(max_depth=16, max_features='log2', min_samples_leaf=9, min_samples_split=7)))])), ('passthrough', Passthrough())])), ('mlpclassifier', MLPClassifier(activation='tanh', alpha=0.0015036151556, hidden_layer_sizes=[435], learning_rate='adaptive', learning_rate_init=0.0002156053435, n_iter_no_change=32))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()), ('passthrough', Passthrough()), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_... ('estimatortransformer-2', EstimatorTransformer(estimator=DecisionTreeClassifier(max_depth=16, max_features='log2', min_samples_leaf=9, min_samples_split=7)))])), ('passthrough', Passthrough())])), ('mlpclassifier', MLPClassifier(activation='tanh', alpha=0.0015036151556, hidden_layer_sizes=[435], learning_rate='adaptive', learning_rate_init=0.0002156053435, n_iter_no_change=32))])
StandardScaler()
Passthrough()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())])
ColumnOneHotEncoder()
Passthrough()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('estimatortransformer-1', EstimatorTransformer(estimator=QuadraticDiscriminantAnalysis(reg_param=0.2249685248521))), ('estimatortransformer-2', EstimatorTransformer(estimator=DecisionTreeClassifier(max_depth=16, max_features='log2', min_samples_leaf=9, min_samples_split=7)))])), ('passthrough', Passthrough())])
QuadraticDiscriminantAnalysis(reg_param=0.2249685248521)
QuadraticDiscriminantAnalysis(reg_param=0.2249685248521)
DecisionTreeClassifier(max_depth=16, max_features='log2', min_samples_leaf=9, min_samples_split=7)
DecisionTreeClassifier(max_depth=16, max_features='log2', min_samples_leaf=9, min_samples_split=7)
Passthrough()
MLPClassifier(activation='tanh', alpha=0.0015036151556, hidden_layer_sizes=[435], learning_rate='adaptive', learning_rate_init=0.0002156053435, n_iter_no_change=32)