Genetic Feature Selection nodes in TPOT2¶
TPOT2 can use evolutionary algorithms to optimize feature selection simultaneously with pipeline optimization. It includes two node search spaces with different feature selection strategies: FSSNode and GeneticFeatureSelectorNode.
FSSNode - (Feature Set Selector) This node is useful if you have a list of predefined feature sets you want to select from. Each FeatureSetSelector Node will select a single group of features to be passed to the next step in the pipeline. Note that FSSNode does not create its own subset of features and does not mix/match multiple predefined feature sets.
GeneticFeatureSelectorNode—Whereas the FSSNode selects from a predefined list of subsets of features, this node uses evolutionary algorithms to optimize a novel subset of features from scratch. This is useful where there is no predefined grouping of features.
This tutorial focuses on FSSNode. See Tutorial 5 for more information on GeneticFeatureSelectorNode.
It may also be beneficial to pair these search spaces with a secondary objective function to minimize complexity. That would encourage TPOT to try to produce the simplest pipeline with the fewest number of features.
tpot2.objectives.number_of_nodes_objective - This can be used as an other_objective_function that counts the number of nodes.
tpot2.objectives.complexity_scorer - This is a scorer that tries to count the total number of learned parameters (number of coefficients, number of nodes in decision trees, etc.).
Feature Set Selector¶
The FeatureSetSelector is a subclass of sklearn.feature_selection.SelectorMixin that simply returns the manually specified columns. The parameter sel_subset specifies the name or index of the column that it selects. The transform function then simply indexes and returns the selected columns. You can also optionally name the group with the name parameter, though this is only for note keeping and does is not used by the class.
sel_subset: list or int
If X is a dataframe, items in sel_subset list must correspond to column names
If X is a numpy array, items in sel_subset list must correspond to column indexes
int: index of a single column
import tpot2
import pandas as pd
import numpy as np
#make a dataframe with columns a,b,c,d,e,f
#numpy array where columns are 1,2,3,4,5,6
data = np.repeat([np.arange(6)],10,0)
df = pd.DataFrame(data,columns=['a','b','c','d','e','f'])
fss = tpot2.builtin_modules.FeatureSetSelector(name='test',sel_subset=['a','b','c'])
print("original DataFrame")
print(df)
print("Transformed Data")
print(fss.fit_transform(df))
original DataFrame a b c d e f 0 0 1 2 3 4 5 1 0 1 2 3 4 5 2 0 1 2 3 4 5 3 0 1 2 3 4 5 4 0 1 2 3 4 5 5 0 1 2 3 4 5 6 0 1 2 3 4 5 7 0 1 2 3 4 5 8 0 1 2 3 4 5 9 0 1 2 3 4 5 Transformed Data [[0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2] [0 1 2]]
FSSNode¶
The FSSNode
is a node search space that simply selects one feature set from a list of feature sets. This works identically to the EstimatorNode, but provides a easier interface for defining the feature sets.
Note that the FSS is only well defined when used as the first step in a pipeline. This is because downstream nodes will receive different transformations of the data such that the original indexes no longer correspond to the same columns in the transformed data.
The FSSNode
takes in a single parameter subsets
which defines the groups of features. There are four ways of defining the subsets.
subsets : str or list, default=None
Sets the subsets that the FeatureSetSeletor will select from if set as an option in one of the configuration dictionaries.
Features are defined by column names if using a Pandas data frame, or ints corresponding to indexes if using numpy arrays.
- str : If a string, it is assumed to be a path to a csv file with the subsets.
The first column is assumed to be the name of the subset and the remaining columns are the features in the subset.
- list or np.ndarray : If a list or np.ndarray, it is assumed to be a list of subsets (i.e a list of lists).
- dict : A dictionary where keys are the names of the subsets and the values are the list of features.
- int : If an int, it is assumed to be the number of subsets to generate. Each subset will contain one feature.
- None : If None, each column will be treated as a subset. One column will be selected per subset.
Lets say you want to have three groups of features, each with three columns each. The following examples are equivalent:
str¶
sel_subsets=simple_fss.csv
\# simple_fss.csv
group_one, 1,2,3
group_two, 4,5,6
group_three, 7,8,9
dict¶
sel_subsets = { "group_one" : [1,2,3], "group_two" : [4,5,6], "group_three" : [7,8,9], }
list¶
sel_subsets = [[1,2,3], [4,5,6], [7,8,9]]
Examples¶
For these examples, we create a dummy dataset where the first six columns are informative and the rest are uninformative.
import tpot2
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import tpot2
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
from tpot2.search_spaces.nodes import *
from tpot2.search_spaces.pipelines import *
from tpot2.config import get_search_space
X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=6, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],6)]) #add six uninformative features
X = pd.DataFrame(X, columns=['a','b','c','d','e','f','g','h','i', 'j', 'k', 'l']) # a, b ,c the rest are uninformative
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
X.head()
a | b | c | d | e | f | g | h | i | j | k | l | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.988411 | -3.270714 | -1.816697 | 0.384124 | 1.258591 | -1.577232 | 0.101273 | 0.657975 | 0.770880 | 0.882366 | 0.637714 | 0.002812 |
1 | -0.531157 | -1.298541 | -2.630749 | 0.036662 | -2.097307 | -1.711751 | 0.894172 | 0.727579 | 0.211429 | 0.223319 | 0.496683 | 0.840040 |
2 | -0.896734 | -1.805453 | -2.736948 | -0.310169 | 1.802988 | -0.269441 | 0.765178 | 0.341713 | 0.847770 | 0.696190 | 0.824104 | 0.297523 |
3 | 1.637719 | -0.930537 | -0.229303 | 0.198907 | 1.184137 | -0.411545 | 0.870378 | 0.811312 | 0.142528 | 0.707361 | 0.201967 | 0.867956 |
4 | -1.709777 | -2.701615 | 0.297434 | -0.909832 | 1.436884 | 0.120985 | 0.866854 | 0.352461 | 0.690270 | 0.172950 | 0.056518 | 0.806867 |
Lets say that either based on prior knowledge or interest, we know that the features can be grouped as follows
subsets = { "group_one" : ['a','b','c',],
"group_two" : ['d','e','f'],
"group_three" : ['g','h','i'],
"group_four" : ['j','k','l'],
}
We can create an FSSNode that will select from this subset. Each node in a pipeline only selects one subset.
fss_search_space = FSSNode(subsets=subsets)
If we randomly sample from this search space, we can see that we get a single selector that selects one of the predefined sets. In this case, it selects groups two, which includes ['d', 'e', 'f']. (A random seed was set in the generate function so that the same group would be selected when rerunning the notebook.)
fss_selector = fss_search_space.generate(rng=1).export_pipeline()
fss_selector
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
fss_selector.set_output(transform="pandas") #by default sklearn selectors return numpy arrays. this will make it return pandas dataframes
fss_selector.fit(X_train)
fss_selector.transform(X_train)
d | e | f | |
---|---|---|---|
28 | -2.393671 | 2.653494 | 1.336840 |
540 | -1.598037 | -2.639941 | -1.787062 |
980 | -1.562249 | 1.573867 | -0.135207 |
812 | 0.084835 | 1.809188 | -1.525609 |
117 | 0.647414 | 1.437139 | 1.873279 |
... | ... | ... | ... |
630 | 0.102721 | 0.463829 | -0.220689 |
963 | -0.530709 | 0.353686 | 0.621369 |
943 | 3.850193 | 0.948248 | -2.042764 |
930 | 1.051634 | 1.240570 | -1.477092 |
116 | -0.126476 | -1.599799 | -0.610169 |
750 rows × 3 columns
Under the hood, mutation will randomly select another feature set and crossover will swap the feature sets selected by two individuals
ind1 = fss_search_space.generate(rng=1)
ind1.export_pipeline()
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
ind1.mutate()
ind1.export_pipeline()
FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])
We can now use this when defining our pipelines. For this first example, we will construct a simple linear pipeline where the first step is a feature set selector, and the second is a classifier
classification_search_space = get_search_space(["RandomForestClassifier"])
fss_and_classifier_search_space = SequentialPipeline([fss_search_space, classification_search_space])
est = tpot2.TPOTEstimator(generations=5,
scorers=["roc_auc_ovr", tpot2.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
n_jobs=32,
classification=True,
search_space = fss_and_classifier_search_space,
verbose=1,
)
scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
Generation: 100%|██████████| 5/5 [00:30<00:00, 6.11s/it]
0.90263107355483
est.fitted_pipeline_
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('randomforestclassifier', RandomForestClassifier(criterion='entropy', max_features=0.4070021568844, min_samples_leaf=4, min_samples_split=3, n_estimators=128))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('randomforestclassifier', RandomForestClassifier(criterion='entropy', max_features=0.4070021568844, min_samples_leaf=4, min_samples_split=3, n_estimators=128))])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
RandomForestClassifier(criterion='entropy', max_features=0.4070021568844, min_samples_leaf=4, min_samples_split=3, n_estimators=128)
With this setup TPOT is able to identify one of the subsets used, but the performance is not optimal. In this case we happen to know that multiple feature sets are required. If we want to include multiple features in our pipelines, we will have to modify our search space. There are three options for this.
- UnionPipeline - This allows you to have a fixed number of feature sets selected. If you use a UnionPipeline with two FSSNodes, you will always select two feature sets that are simply concatenated together.
- DynamicUnionPipeline - This space allows multiple FSSNodes to be selected. Unlike UnionPipeline you don't have to specify the number of selected sets, TPOT will identify the number of sets that are optimal. Additionally, with DynamicUnionPipeline, the same feature set cannot be selected twice. Note that while DynamicUnionPipeline can select multiple feature sets, it never mixes two feature sets together.
- GraphSearchPipeline - When set as the leave_search_space, GraphSearchPipeline can also select multiple FSSNodes which act as an input to the rest of the pipeline.
UnionPipeline + FSSNode example¶
union_fss_space = UnionPipeline([fss_search_space, fss_search_space])
# this union search space will always select exactly two fss_search_space
selector1 = union_fss_space.generate(rng=1).export_pipeline()
selector1
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('featuresetselector-2', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])), ('featuresetselector-2', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])
selector1.set_output(transform="pandas")
selector1.fit(X_train)
selector1.transform(X_train)
d | e | f | g | h | i | |
---|---|---|---|---|---|---|
28 | -2.393671 | 2.653494 | 1.336840 | 0.671229 | 0.431712 | 0.090788 |
540 | -1.598037 | -2.639941 | -1.787062 | 0.520648 | 0.436337 | 0.576560 |
980 | -1.562249 | 1.573867 | -0.135207 | 0.323676 | 0.052558 | 0.892457 |
812 | 0.084835 | 1.809188 | -1.525609 | 0.777859 | 0.327459 | 0.626609 |
117 | 0.647414 | 1.437139 | 1.873279 | 0.383676 | 0.448043 | 0.908426 |
... | ... | ... | ... | ... | ... | ... |
630 | 0.102721 | 0.463829 | -0.220689 | 0.155922 | 0.057284 | 0.581789 |
963 | -0.530709 | 0.353686 | 0.621369 | 0.701410 | 0.205080 | 0.189494 |
943 | 3.850193 | 0.948248 | -2.042764 | 0.737312 | 0.082513 | 0.886070 |
930 | 1.051634 | 1.240570 | -1.477092 | 0.207093 | 0.349121 | 0.027916 |
116 | -0.126476 | -1.599799 | -0.610169 | 0.185323 | 0.024521 | 0.685559 |
750 rows × 6 columns
DynamicUnionPipeline + FSSNode example¶
The dynamic union pipeline may select a variable number of feature sets.
dynamic_fss_space = DynamicUnionPipeline(fss_search_space)
dynamic_fss_space.generate(rng=1).export_pipeline()
FeatureUnion(transformer_list=[('featuresetselector', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('featuresetselector', FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i']))])
FeatureSetSelector(name='group_three', sel_subset=['g', 'h', 'i'])
dynamic_fss_space.generate(rng=3).export_pipeline()
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('featuresetselector-2', FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l']))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('featuresetselector-2', FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l']))])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
GraphSearchPipeline + FSSNode example¶
FSSNodes must be set as the leaf search space as they act as the inputs to the pipeline.
Here is an example pipeline from this search space that utilizes two feature sets.
graph_search_space = tpot2.search_spaces.pipelines.GraphSearchPipeline(
leaf_search_space = fss_search_space,
inner_search_space = tpot2.config.get_search_space(["transformers"]),
root_search_space= tpot2.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"]),
max_size = 10,
)
graph_search_space.generate(rng=4).export_pipeline().plot()
Optimize with TPOT¶
For this example, we will optimize the DynamicUnion search space
import tpot2
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
final_classification_search_space = SequentialPipeline([dynamic_fss_space, classification_search_space])
est = tpot2.TPOTEstimator(generations=5,
scorers=["roc_auc_ovr", tpot2.objectives.complexity_scorer],
scorers_weights=[1.0, -1.0],
n_jobs=32,
classification=True,
search_space = final_classification_search_space,
verbose=1,
)
scorer = sklearn.metrics.get_scorer('roc_auc_ovr')
est.fit(X_train, y_train)
print(scorer(est, X_test, y_test))
Generation: 100%|██████████| 5/5 [00:34<00:00, 6.88s/it]
0.9482747583381345
We can see that this pipeline performed slightly better and correctly identified group one and group two as the feature sets used in the generative equation.
est.fitted_pipeline_
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('featuresetselector-2', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f']))])), ('randomforestclassifier', RandomForestClassifier(bootstrap=False, class_weight='balanced', max_features=0.4909664847192, min_samples_leaf=2, min_samples_split=4, n_estimators=128))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('featuresetselector-2', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f']))])), ('randomforestclassifier', RandomForestClassifier(bootstrap=False, class_weight='balanced', max_features=0.4909664847192, min_samples_leaf=2, min_samples_split=4, n_estimators=128))])
FeatureUnion(transformer_list=[('featuresetselector-1', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('featuresetselector-2', FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f']))])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
FeatureSetSelector(name='group_two', sel_subset=['d', 'e', 'f'])
RandomForestClassifier(bootstrap=False, class_weight='balanced', max_features=0.4909664847192, min_samples_leaf=2, min_samples_split=4, n_estimators=128)
Combining with existing search spaces¶
As with all search spaces, FSSNode can be combined with any other search space.
You can also pair this with the existing prebuilt templates, for example:
linear_search_space = tpot2.config.template_search_spaces.get_template_search_spaces("linear", classification=True)
fss_and_linear_search_space = SequentialPipeline([fss_search_space, linear_search_space])
# est = tpot2.TPOTEstimator(
# population_size=32,
# generations=10,
# scorers=["roc_auc_ovr", tpot2.objectives.complexity_scorer],
# scorers_weights=[1.0, -1.0],
# other_objective_functions=[number_of_selected_features],
# other_objective_functions_weights = [-1],
# objective_function_names = ["Number of selected features"],
# n_jobs=32,
# classification=True,
# search_space = fss_and_linear_search_space,
# verbose=2,
# )
fss_and_linear_search_space.generate(rng=1).export_pipeline()
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_two', sel_subset=[3, 4, 5])), ('pipeline', Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('rfe', RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)), ('featureunion-1', FeatureUnion(transformer_list=[('f... FeatureUnion(transformer_list=[('powertransformer', PowerTransformer()), ('pca', PCA(n_components=0.9286371732844))])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('kneighborsclassifier', KNeighborsClassifier(n_jobs=1, n_neighbors=21, weights='distance'))]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_two', sel_subset=[3, 4, 5])), ('pipeline', Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('rfe', RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)), ('featureunion-1', FeatureUnion(transformer_list=[('f... FeatureUnion(transformer_list=[('powertransformer', PowerTransformer()), ('pca', PCA(n_components=0.9286371732844))])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('kneighborsclassifier', KNeighborsClassifier(n_jobs=1, n_neighbors=21, weights='distance'))]))])
FeatureSetSelector(name='group_two', sel_subset=[3, 4, 5])
Pipeline(steps=[('maxabsscaler', MaxAbsScaler()), ('rfe', RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)), ('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('powertransformer', PowerTransformer()), ('pca', PCA(n_components=0.9286371732844))])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])), ('kneighborsclassifier', KNeighborsClassifier(n_jobs=1, n_neighbors=21, weights='distance'))])
MaxAbsScaler()
RFE(estimator=ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1), step=0.753983388654)
ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1)
ExtraTreesClassifier(max_features=0.0390676831531, min_samples_leaf=8, min_samples_split=14, n_jobs=1)
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('powertransformer', PowerTransformer()), ('pca', PCA(n_components=0.9286371732844))])), ('passthrough', Passthrough())])
PowerTransformer()
PCA(n_components=0.9286371732844)
Passthrough()
FeatureUnion(transformer_list=[('skiptransformer', SkipTransformer()), ('passthrough', Passthrough())])
SkipTransformer()
Passthrough()
KNeighborsClassifier(n_jobs=1, n_neighbors=21, weights='distance')
Getting Fancy¶
If you want to get fancy, you can combine more search spaces in order to set up unique preprocessing pipelines per feature set. Here's an example:
dynamic_transformers = DynamicUnionPipeline(get_search_space("all_transformers"), max_estimators=4)
dynamic_transformers_with_passthrough = tpot2.search_spaces.pipelines.UnionPipeline([
dynamic_transformers,
tpot2.config.get_search_space("Passthrough")],
)
multi_step_engineering = DynamicLinearPipeline(dynamic_transformers_with_passthrough, max_length=4)
fss_engineering_search_space = SequentialPipeline([fss_search_space, multi_step_engineering])
union_fss_engineering_search_space = DynamicUnionPipeline(fss_engineering_search_space)
final_fancy_search_space = SequentialPipeline([union_fss_engineering_search_space, classification_search_space])
final_fancy_search_space.generate(rng=3).export_pipeline()
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('pipeline', Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('pca', PCA(n_components=0.93113403057))])), ('passt... FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=87)), ('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())]))]))]))])), ('randomforestclassifier', RandomForestClassifier(class_weight='balanced', criterion='entropy', max_features=0.021545996678, min_samples_leaf=11, n_estimators=128))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('pipeline', Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('pca', PCA(n_components=0.93113403057))])), ('passt... FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=87)), ('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())]))]))]))])), ('randomforestclassifier', RandomForestClassifier(class_weight='balanced', criterion='entropy', max_features=0.021545996678, min_samples_leaf=11, n_estimators=128))])
FeatureUnion(transformer_list=[('pipeline-1', Pipeline(steps=[('featuresetselector', FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])), ('pipeline', Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('pca', PCA(n_components=0.93113403057))])), ('passthrough', Passthrough())]))]))])),... FeatureUnion(transformer_list=[('binarizer', Binarizer(threshold=0.5396272782675))])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=87)), ('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())]))]))]))])
FeatureSetSelector(name='group_one', sel_subset=['a', 'b', 'c'])
Pipeline(steps=[('featureunion', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('pca', PCA(n_components=0.93113403057))])), ('passthrough', Passthrough())]))])
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('pca', PCA(n_components=0.93113403057))])), ('passthrough', Passthrough())])
PCA(n_components=0.93113403057)
Passthrough()
FeatureSetSelector(name='group_four', sel_subset=['j', 'k', 'l'])
Pipeline(steps=[('featureunion-1', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('binarizer', Binarizer(threshold=0.5396272782675))])), ('passthrough', Passthrough())])), ('featureunion-2', FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=87)), ('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())]))])
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('binarizer', Binarizer(threshold=0.5396272782675))])), ('passthrough', Passthrough())])
Binarizer(threshold=0.5396272782675)
Passthrough()
FeatureUnion(transformer_list=[('featureunion', FeatureUnion(transformer_list=[('quantiletransformer', QuantileTransformer(n_quantiles=87)), ('columnonehotencoder', ColumnOneHotEncoder())])), ('passthrough', Passthrough())])
QuantileTransformer(n_quantiles=87)
ColumnOneHotEncoder()
Passthrough()
RandomForestClassifier(class_weight='balanced', criterion='entropy', max_features=0.021545996678, min_samples_leaf=11, n_estimators=128)
Other examples¶
dictionary¶
import tpot2
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = { "group_one" : ['a','b','c'],
"group_two" : ['d','e','f'],
"group_three" : ['g','h','i'],
}
fss_search_space = tpot2.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
d | e | f | |
---|---|---|---|
28 | -2.393671 | 2.653494 | 1.336840 |
540 | -1.598037 | -2.639941 | -1.787062 |
980 | -1.562249 | 1.573867 | -0.135207 |
812 | 0.084835 | 1.809188 | -1.525609 |
117 | 0.647414 | 1.437139 | 1.873279 |
... | ... | ... | ... |
630 | 0.102721 | 0.463829 | -0.220689 |
963 | -0.530709 | 0.353686 | 0.621369 |
943 | 3.850193 | 0.948248 | -2.042764 |
930 | 1.051634 | 1.240570 | -1.477092 |
116 | -0.126476 | -1.599799 | -0.610169 |
750 rows × 3 columns
list¶
import tpot2
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = [['a','b','c'],['d','e','f'],['g','h','i']]
fss_search_space = tpot2.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
d | e | f | |
---|---|---|---|
28 | -2.393671 | 2.653494 | 1.336840 |
540 | -1.598037 | -2.639941 | -1.787062 |
980 | -1.562249 | 1.573867 | -0.135207 |
812 | 0.084835 | 1.809188 | -1.525609 |
117 | 0.647414 | 1.437139 | 1.873279 |
... | ... | ... | ... |
630 | 0.102721 | 0.463829 | -0.220689 |
963 | -0.530709 | 0.353686 | 0.621369 |
943 | 3.850193 | 0.948248 | -2.042764 |
930 | 1.051634 | 1.240570 | -1.477092 |
116 | -0.126476 | -1.599799 | -0.610169 |
750 rows × 3 columns
csv file¶
note: watch for spaces in the csv file!
import tpot2
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = 'simple_fss.csv'
'''
# simple_fss.csv
one,a,b,c
two,d,e,f
three,g,h,i
'''
fss_search_space = tpot2.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.set_output(transform="pandas")
selector.fit(X_train)
selector.transform(X_train)
d | e | f | |
---|---|---|---|
28 | -2.393671 | 2.653494 | 1.336840 |
540 | -1.598037 | -2.639941 | -1.787062 |
980 | -1.562249 | 1.573867 | -0.135207 |
812 | 0.084835 | 1.809188 | -1.525609 |
117 | 0.647414 | 1.437139 | 1.873279 |
... | ... | ... | ... |
630 | 0.102721 | 0.463829 | -0.220689 |
963 | -0.530709 | 0.353686 | 0.621369 |
943 | 3.850193 | 0.948248 | -2.042764 |
930 | 1.051634 | 1.240570 | -1.477092 |
116 | -0.126476 | -1.599799 | -0.610169 |
750 rows × 3 columns
All of the above is the same when using numpy data, but the column names are replaced int indexes.
import tpot2
import sklearn.datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
n_features = 6
X, y = sklearn.datasets.make_classification(n_samples=1000, n_features=n_features, n_informative=6, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
X = np.hstack([X, np.random.rand(X.shape[0],3)]) #add three uninformative features
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)
print(X)
[[-0.43714166 -0.50887207 -3.17945595 ... 0.07671291 0.47607558 0.89683945] [-2.83836404 -0.22115893 -0.07445108 ... 0.03073931 0.2766683 0.36285899] [-2.28029617 -1.38851427 -3.22134569 ... 0.92830528 0.59176052 0.18041296] ... [ 0.61359823 -0.41893724 -2.9625971 ... 0.75602013 0.52478388 0.69249969] [-2.27709727 2.99680411 0.70411587 ... 0.02910316 0.93519319 0.0034257 ] [-1.59654364 -0.53352175 -0.50919438 ... 0.63719765 0.47591644 0.84288743]]
import tpot2
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import sklearn
subsets = { "group_one" : [0,1,2],
"group_two" : [3,4,5],
"group_three" : [6,7,8],
}
fss_search_space = tpot2.search_spaces.nodes.FSSNode(subsets=subsets)
selector = fss_search_space.generate(rng=1).export_pipeline()
selector.fit(X_train)
selector.transform(X_train)
array([[-2.15063999, -0.84591563, -0.66736542], [-0.95324351, 2.00496434, 1.22398102], [-0.08542414, -0.26901573, -3.67530636], ..., [ 0.48872267, -0.87071824, 1.60102349], [-4.45746257, -2.41209776, 0.42331464], [-0.72541871, -0.02783289, -1.98627911]])