Skip to content

Genetic feature selection

This file is part of the TPOT library.

The current version of TPOT was developed at Cedars-Sinai by: - Pedro Henrique Ribeiro (https://github.com/perib, https://www.linkedin.com/in/pedro-ribeiro/) - Anil Saini (anil.saini@cshs.org) - Jose Hernandez (jgh9094@gmail.com) - Jay Moran (jay.moran@cshs.org) - Nicholas Matsumoto (nicholas.matsumoto@cshs.org) - Hyunjun Choi (hyunjun.choi@cshs.org) - Gabriel Ketron (gabriel.ketron@cshs.org) - Miguel E. Hernandez (miguel.e.hernandez@cshs.org) - Jason Moore (moorejh28@gmail.com)

The original version of TPOT was primarily developed at the University of Pennsylvania by: - Randal S. Olson (rso@randalolson.com) - Weixuan Fu (weixuanf@upenn.edu) - Daniel Angell (dpa34@drexel.edu) - Jason Moore (moorejh28@gmail.com) - and many more generous open-source contributors

TPOT is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

TPOT is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with TPOT. If not, see http://www.gnu.org/licenses/.

GeneticFeatureSelectorNode

Bases: SearchSpace

Source code in tpot/search_spaces/nodes/genetic_feature_selection.py
class GeneticFeatureSelectorNode(SearchSpace):
    def __init__(self,                     
                    n_features,
                    start_p=0.2,
                    mutation_rate = 0.1,
                    crossover_rate = 0.1,
                    mutation_rate_rate = 0, # These are still experimental but seem to help. Theory is that it takes slower steps as it gets closer to the optimal solution.
                    crossover_rate_rate = 0,# Otherwise is mutation_rate is too small, it takes forever, and if its too large, it never converges.
                    ):
        """
        A node that generates a GeneticFeatureSelectorIndividual. Uses genetic algorithm to select novel subsets of features.

        Parameters
        ----------
        n_features : int
            Number of features in the dataset.
        start_p : float
            Probability of selecting a given feature for the initial subset of features.
        mutation_rate : float
            Probability of adding/removing a feature from the subset of features.
        crossover_rate : float
            Probability of swapping a feature between two subsets of features.
        mutation_rate_rate : float
            Probability of changing the mutation rate. (experimental)
        crossover_rate_rate : float
            Probability of changing the crossover rate. (experimental)

        """

        self.n_features = n_features
        self.start_p = start_p
        self.mutation_rate = mutation_rate
        self.crossover_rate = crossover_rate
        self.mutation_rate_rate = mutation_rate_rate
        self.crossover_rate_rate = crossover_rate_rate


    def generate(self, rng=None) -> SklearnIndividual:
        return GeneticFeatureSelectorIndividual(   mask=self.n_features,
                                                    start_p=self.start_p,
                                                    mutation_rate=self.mutation_rate,
                                                    crossover_rate=self.crossover_rate,
                                                    mutation_rate_rate=self.mutation_rate_rate,
                                                    crossover_rate_rate=self.crossover_rate_rate,
                                                    rng=rng
                                                )

__init__(n_features, start_p=0.2, mutation_rate=0.1, crossover_rate=0.1, mutation_rate_rate=0, crossover_rate_rate=0)

A node that generates a GeneticFeatureSelectorIndividual. Uses genetic algorithm to select novel subsets of features.

Parameters:

Name Type Description Default
n_features int

Number of features in the dataset.

required
start_p float

Probability of selecting a given feature for the initial subset of features.

0.2
mutation_rate float

Probability of adding/removing a feature from the subset of features.

0.1
crossover_rate float

Probability of swapping a feature between two subsets of features.

0.1
mutation_rate_rate float

Probability of changing the mutation rate. (experimental)

0
crossover_rate_rate float

Probability of changing the crossover rate. (experimental)

0
Source code in tpot/search_spaces/nodes/genetic_feature_selection.py
def __init__(self,                     
                n_features,
                start_p=0.2,
                mutation_rate = 0.1,
                crossover_rate = 0.1,
                mutation_rate_rate = 0, # These are still experimental but seem to help. Theory is that it takes slower steps as it gets closer to the optimal solution.
                crossover_rate_rate = 0,# Otherwise is mutation_rate is too small, it takes forever, and if its too large, it never converges.
                ):
    """
    A node that generates a GeneticFeatureSelectorIndividual. Uses genetic algorithm to select novel subsets of features.

    Parameters
    ----------
    n_features : int
        Number of features in the dataset.
    start_p : float
        Probability of selecting a given feature for the initial subset of features.
    mutation_rate : float
        Probability of adding/removing a feature from the subset of features.
    crossover_rate : float
        Probability of swapping a feature between two subsets of features.
    mutation_rate_rate : float
        Probability of changing the mutation rate. (experimental)
    crossover_rate_rate : float
        Probability of changing the crossover rate. (experimental)

    """

    self.n_features = n_features
    self.start_p = start_p
    self.mutation_rate = mutation_rate
    self.crossover_rate = crossover_rate
    self.mutation_rate_rate = mutation_rate_rate
    self.crossover_rate_rate = crossover_rate_rate

MaskSelector

Bases: SelectorMixin, BaseEstimator

Select predefined feature subsets.

Source code in tpot/search_spaces/nodes/genetic_feature_selection.py
class MaskSelector(SelectorMixin, BaseEstimator):
    """Select predefined feature subsets."""

    def __init__(self, mask, set_output_transform=None):
        self.mask = mask
        self.set_output_transform = set_output_transform
        if set_output_transform is not None:
            self.set_output(transform=set_output_transform)

    def fit(self, X, y=None):
        self.n_features_in_ = X.shape[1]
        if isinstance(X, pd.DataFrame):
            self.feature_names_in_ = X.columns
        #     self.set_output(transform="pandas")
        self.is_fitted_ = True #so sklearn knows it's fitted
        return self

    def __sklearn_tags__(self):
        tags = super().__sklearn_tags__()
        tags.input_tags.allow_nan = True
        tags.target_tags.required = False # formally requires_y
        return tags

    def _get_support_mask(self):
        return np.array(self.mask)

    def get_feature_names_out(self, input_features=None):
        return self.feature_names_in_[self.get_support()]