AI API

AI

class ai.ai.AI(rec_class=None, api_path=None, extra_payload={}, user='testuser', rec_score_file='rec_state.obj', verbose=True, warm_start=False, n_recs=1, datasets=False, use_knowledgebase=False, term_condition='n_recs', max_time=5)[source]

AI managing agent for Aliro.

Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.

Parameters
  • rec_class – ai.BaseRecommender - recommender to use

  • api_path – string - path to the lab api server

  • extra_payload – dict - any additional payload that needs to be specified

  • user – string - test user

  • rec_score_file – file - pickled score file to keep persistent scores between sessions

  • verbose – Boolean

  • warm_start – Boolean - if true, attempt to load the ai state from the file provided by rec_score_file

  • n_recs – int - number of recommendations to make for each request

  • datasets – str or False - if not false, a comma seperated list of datasets to turn the ai on for at startup

  • use_pmlb_knowledgebase – Boolean

check_requests()[source]

Check to see if any new AI requests have been submitted. If so, add them to self.request_queue.

Returns

Boolean - True if new AI requests have been submitted

check_results()[source]

Checks to see if new experiment results have been posted since the previous time step. If so, set them to self.new_data and return True.

Returns

Boolean - True if new results were found

generate_recommendations(datasetId, numOfRecs)[source]

Generate ml recommendation payloads for the given dataset.

:param datasetId :param numOfRecs

:returns list of maps that represent request payload objects

get_results_metafeatures(results_data)[source]

Return a pandas dataframe of metafeatures associated with the datasets in results_data.

Retireves metafeatures from self.dataset_mf_cache if they exist, otherwise queries the api and updates the cache.

Parameters

results_data – experiment results with associated datasets

initialize_recommenders(rec_class)[source]

Initilize classification and regression recommenders

load_knowledgebase()[source]

Bootstrap the recommenders with the knowledgebase.

load_state()[source]

Loads pickled score file and recommender model.

TODO: test that this still works

save_state()[source]

Save ML+P scores in pickle or to DB

TODO: test that this still works

transfer_rec(rec_payload)[source]

Attempt to send a recommendation to the lab server. If any error other then a no capacity error occurs, throw an exception.

Parameters

rec_payload – dictionary - the payload describing the experiment

:return bool - true if successfully sent, false if no machine capacity available

update_recommender()[source]

Update recommender models based on new experiment results in self.new_data, and then clear self.new_data.

Recommenders

Base Recommender

class ai.recommender.base.BaseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Base recommender for Aliro

The BaseRecommender is not intended to be used directly; it is a skeleton class defining the interface for future recommenders within the Aliro project.

Parameters
  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

  • ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’

  • knowledgebase_results (Pandas DataFrame or None) – Initial knowledgebase results data. If not None and not loading a serialized recommender, the recommender will initialize and train on this data. If loading a serialized recommender, this is the knowledgebase that accompanies it.

  • knowledgebase_metafeatures (Pandas DataFrame or None) – Initial knowledgebase metafeatures data. If loading a serialized recommender, this is the knowledgebase that accompanies it.

  • serialized_rec_directory (string or None) – Name of the directory to save/load a serialized recommender. Default directory is “.”

  • serialized_rec_filename (string or None) – Name of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • load_serialized_rec (str, "always", "never", "if_exists") –

    Whether to attempt to load a serialized recommender:

    ”if_exists” - If a serialized recomender exsists at the specified path, load it. “always” - Always load a serialized recommender. Throw an exception if no serialized recommender exists. “never” - Never load a serialized recommender.

load(filename=None, knowledgebase=None)[source]

Load a saved recommender state.

Parameters
  • filename – string or None Name of file to load

  • knowledgebase

    string or None DataFrame with columns corresponding to:

    ’dataset’ ‘algorithm’ ‘parameters’ self.metric

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

Parameters
  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

  • dataset_mf (DataFrame) – metafeatures of the dataset represented by dataset_id

save(filename=None)[source]

Save the current recommender.

Parameters

filename – string or None Name of file to load

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations.

Parameters
  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.

  • source (string) – if ‘pennai’, will update tally of trained dataset models

update_and_save(results_data, results_mf=None, source='pennai', filename=None)[source]

runs self.update() and self.save.

Parameters
  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.

  • source (string) – if ‘pennai’, will update tally of trained dataset models

Random Recommender

class ai.recommender.random_recommender.RandomRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Aliro random recommender.

Recommends random machine learning algorithms and parameters from the possible algorithms fetched from the server.

Parameters
  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

  • ml_p (Dataframe) – Contains all the machine learning / algorithm combinations available for recommendation.

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

Parameters
  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of len n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations.

Parameters
  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.

Average Recommender

class ai.recommender.average_recommender.AverageRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Aliro average recommender.

Recommends machine learning algorithms and parameters based on their average performance across all evaluated datasets.

Parameters
  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

Parameters
  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations based on overall performance in results_data.

Updates self.scores

Parameters

results_data (DataFrame with columns corresponding to:) – ‘dataset’ ‘algorithm’ ‘parameters’ self.metric

KNN Recommender

class ai.recommender.knn_meta_recommender.KNNMetaRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Aliro KNN meta recommender.

Recommends machine learning algorithms and parameters as follows:
  • store the best ML + P on every dataset.

  • given a new dataset, measure its distance to all results in metafeature space.

  • recommend ML + P with best performance on closest dataset.

Parameters
  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

  • ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’

all_dataset_mf

Initialize recommendation system.

best_model_prediction(dataset_id, df_mf, n_recs=1)[source]

Predict scores over many variations of ML+P and pick the best

recommend(dataset_id, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

Parameters
  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf, source='pennai')[source]

Update ML / Parameter recommendations.

Parameters
  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame) – columns corresponding to metafeatures of each dataset in results_data.

update_model(results_data)[source]

Stores best ML-P on each dataset.

Surprise Recommenders

We have a customized version of the Surprise library available here.

class ai.recommender.surprise_recommenders.SurpriseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Class to support generic recommenders from the Surprise library. Not intended to be used as a standalone class.

Parameters
  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

load(filename=None, knowledgebase=None)[source]

Load a saved recommender state.

max_epochs

Initialize recommendation system.

recommend(dataset_id, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

Parameters
  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations based on overall performance in results_data.

Parameters
  • results_data – DataFrame with columns corresponding to: ‘dataset’ ‘algorithm’ ‘parameters’ self.metric

  • results_mf – metafeatures for the datasets in results_data

class ai.recommender.surprise_recommenders.CoClusteringRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via CoClustering, see https://surprise.readthedocs.io/en/stable/co_clustering.html

class ai.recommender.surprise_recommenders.KNNWithMeansRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via KNNWithMeans, see https://surprise.readthedocs.io/en/stable/knn_inspired.html

class ai.recommender.surprise_recommenders.KNNDatasetRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via KNN with clusters defined over datasets, see https://surprise.readthedocs.io/en/stable/knn_inspired.html

class ai.recommender.surprise_recommenders.KNNMLRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via KNN with clusters defined over algorithms, see https://surprise.readthedocs.io/en/stable/knn_inspired.html

class ai.recommender.surprise_recommenders.SlopeOneRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via SlopeOne, see https://surprise.readthedocs.io/en/stable/slope_one.html

class ai.recommender.surprise_recommenders.SVDRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

SVD recommender. see https://surprise.readthedocs.io/en/stable/matrix_factorization.html Recommends machine learning algorithms and parameters using the SVD algorithm.

  • stores ML + P and every dataset.

  • learns a matrix factorization on the non-missing data.

  • given a dataset, estimates the rankings of all ML+P and returns the top n_recs.

Note that we use a custom online version of SVD found here: https://github.com/lacava/surprise

Scikit-learn API for Aliro engine

This is the API for using Aliro engine as a standalone python package.

class ai.sklearn.pennai_sklearn.PennAI(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]

Aliro standalone sklearn wrapper.

Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.

Parameters
  • rec_class – ai.BaseRecommender - recommender to use

  • verbose – int, 0 quite, 1 info, 2 debug

  • serialized_rec – string or None Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • scoring – str - scoring for evaluating recommendations

  • n_recs – int - number of recommendations to make for each iteration

  • n_iters – int = total number of iteration

  • knowledgebase – file - input file for knowledgebase

  • kb_metafeatures – inputfile for metafeature

  • config_dict – python dictionary - inputfile for hyperparams space for all ML algorithms

  • ensemble – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.

  • max_time_mins – maximum time in minutes that Aliro can run

  • stopping_criteria – int, optional A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.

  • random_state – random state for recommenders

  • n_jobs – int (default: 1) The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

fit(X, y)[source]

Trains Aliro on X,y.

Parameters
  • X (array-like {n_samples, n_features}) – Feature matrix of the training set

  • y (ndarray of shape (n_samples,)) – Target of the training set

Returns

self

Return type

object

predict(X)[source]

Predictions for X.

Parameters

X (array-like {n_samples, n_features}) – Feature matrix of the testing set

Returns

y – The predicted target.

Return type

ndarray of shape (n_samples,)

score(X, y)[source]

Return the score on the given testing data using the user-specified scoring function. :param X: Feature matrix of the testing set :type X: array-like {n_samples, n_features} :param y: Target of the testing set :type y: ndarray of shape (n_samples,)

Returns

accuracy_score – The estimated test set accuracy

Return type

float

Two classes below can be imported from pennai.sklearn after installing pennaipy package via pip. (see User Guide of PennAIpy)

class ai.sklearn.pennai_sklearn.PennAIClassifier(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]

Aliro engine for classification tasks.

Read more in the User Guide of PennAIpy.

Parameters
  • rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.

  • verbose (int) – 0 quite, 1 info, 2 debug

  • serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • scoring (str) – scoring for evaluating recommendations. It could be “accuracy”, “balanced_accuracy”, “f1”, “f1_macro”

  • n_recs (int) – number of recommendations to make for each iteration

  • n_iters (int) – total number of iterations

  • knowledgebase (str) – input file for knowledgebase

  • kb_metafeatures (str) – input file for metafeature

  • config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms

  • ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.

  • max_time_mins – maximum time in minutes that Aliro can run

  • stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.

  • random_state (int) – random state for recommenders

  • n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

class ai.sklearn.pennai_sklearn.PennAIRegressor(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]

Aliro engine for regression tasks.

Read more in the User Guide of PennAIpy.

Parameters
  • rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.

  • verbose (int) – 0 quite, 1 info, 2 debug

  • serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • scoring (str) – scoring for evaluating recommendations. It could be “r2”, “explained_variance”, “neg_mean_squared_error”

  • n_recs (int) – number of recommendations to make for each iteration

  • n_iters (int) – total number of iterations

  • knowledgebase (str) – input file for knowledgebase

  • kb_metafeatures (str) – input file for metafeature

  • config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms

  • ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.

  • max_time_mins – maximum time in minutes that Aliro can run

  • stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.

  • random_state (int) – random state for recommenders

  • n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

Learn

This is the API for building ML models. Aliro uses scikit-learn to achieve this.

IO

These methods control data flow from the server to and from sklearn models.

class machine.learn.io_utils.Experiment(args, basedir='.')[source]
get_input()[source]

Get input data based on experiment ID (_id) from Aliro API.

Returns

input_data – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset

Return type

pandas.Dataframe or list of two pandas.Dataframe

get_model()[source]

Build scikit learn method based on arguments from Aliro API.

Returns

  • model (scikit-learn Estimator) – a machine learning model with scikit-learn API

  • method_type (string) – ‘classification’: classification model ‘regression’: regression model

machine.learn.io_utils.get_projects()[source]

Get all machine learning algorithm’s information from Aliro API This information should be the same with projects.json.

Returns

projects – A dict of all machine learning algorithm’s information

Return type

dict

machine.learn.io_utils.parse_args()[source]

Parse arguments for machine learning algorithm.

Returns

  • args (dict) – Arguments of a experiment from Aliro API

  • param_grid (dict) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

machine.learn.io_utils.get_input_data(_id, tmpdir)[source]

Get input dataset information from Aliro API.

Parameters
  • _id (string) – Experiment ID in Aliro

  • tmpdir (string) – Path of temporary directory

Returns

  • input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame. The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset

  • data_info (dict) –

    • target_name: string, target column name

    • filename: list, filename(s)

    • categories: list, categorical feature name(s)

    • ordinals: dict

      • keys: categorical feature name(s)

      • values: categorical values

machine.learn.io_utils.get_file_data(file_id)[source]

Attempt to retrieve dataset file. If the file is corrupt or an error response is returned, it will rasie an ValueError.

Parameters
  • file_id (string) – File ID from the Aliro database

  • Return (string) – Dataset strings which will be read by pandas and converted to pd.DataFrame

machine.learn.io_utils.check_column(column_name, dataframe)[source]

check if a column exists in Pandas DataFrame. :param column_name: column name :type column_name: string :param dataframe: input dataset DataFrame :type dataframe: pandas.DataFrame

Return type

None

machine.learn.io_utils.bool_type(val)[source]

Convert argument to boolean type. :param val: Value of a parameter in string type :type val: string

Returns

_ – Converted value in boolean type

Return type

boolean

machine.learn.io_utils.none(val)[source]

Convert nono argument to None. :param val: Value of a parameter in string type :type val: string

Returns

_ – If input value if “none”, then the function will return None, otherwise it will retune string.

Return type

None

machine.learn.io_utils.get_type(param_type)[source]

Return convertion function for input type.

Parameters

param_type (string or list) – string, type of a parameter which is defined in projects.json list, list of parameter types (for parameter supportting multiple input types)

Returns

known_types[type] – Function for converting argument from Aliro UI for assigning to scikit-learn estimator

Return type

function

Scikit-learn Utils

These methods generate sklearn models and evaluate them.

machine.learn.skl_utils.balanced_accuracy(y_true, y_pred)[source]

Default scoring function of classification: balanced accuracy. Balanced accuracy computes each class’ accuracy on a per-class basis using a one-vs-rest encoding, then computes an unweighted average of the class accuracies.

Parameters
  • y_true (numpy.ndarray {n_samples}) – True class labels

  • y_pred (numpy.ndarray {n_samples}) – Predicted class labels by the estimator

Returns

fitness – Returns a float value indicating balanced accuracy 0.5 is as good as chance, and 1.0 is perfect predictive accuracy

Return type

float

machine.learn.skl_utils.generate_results(model, input_data, tmpdir, _id, target_name='class', mode='classification', figure_export=True, random_state=None, filename=['test_dataset'], categories=None, ordinals=None, encoding_strategy='OneHotEncoder', param_grid={})[source]

Generate reaults for applying a model on dataset in Aliro.

Parameters
  • model (scikit-learn Estimator) – A machine learning model following scikit-learn API

  • input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use 10 fold CV to estimate train/test scroes. list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset

  • target_name (string) – Target name in input data

  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment id

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

  • figure_export (boolean) – If figure_export is True, the figures will be generated and exported.

  • random_state (int) – Random seed

  • filename (list) – Filename for input dataset

  • categories (list) – List of column names for one-hot encoding

  • ordinals (dict) –

    Dictionary of ordinal features:

    Keys of dictionary: categorical feature name(s) Values of dictionary: categorical values

  • encoding_strategy (string) – Encoding strategy for categorical features defined in projects.json

  • param_grid (dict) – If grid_search is non-empty dictionary, then the experiment will do parameter tuning via GridSearchCV. It should report best result to UI and save all results to knowlegde base.

Return type

None

machine.learn.skl_utils.get_col_idx(feature_names_list, columns)[source]

Get unique indexes of columns based on list of column names. :param feature_names_list: List of column names on dataset :type feature_names_list: list :param columns: List of selected column names :type columns: list

Returns

col_idx – list of selected column indexes

Return type

list

machine.learn.skl_utils.setup_model_params(model, parameter_name, value)[source]

Assign value to a parameter in a model. :param model: Machine learning model following scikit-learn API :type model: scikit-learn Estimator :param parameter_name: Parameter name in the scikit-learn model :type parameter_name: string :param value: Values for assigning to the parameter :type value: object

Returns

model – A new scikit-learn model with a updated parameter

Return type

scikit-learn Estimator

machine.learn.skl_utils.compute_imp_score(model, metric, features, target, random_state)[source]

Compute permuation importance scores for features.

Parameters
  • tmpdir (string) – Temporary directory for saving experiment results

  • model (scikit-learn Estimator) – A fitted scikit-learn model

  • metric (str, callable) – The metric for evaluating the feature importance through permutation. By default, the strings ‘accuracy’ is recommended for classifiers and the string ‘r2’ is recommended for regressors. Optionally, a custom scoring function (e.g., metric=scoring_func) that accepts two arguments, y_true and y_pred, which have similar shape to the y array.

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

  • random_state (int) – Random seed for permuation importances

Returns

  • coefs (np.darray) – Feature importance scores

  • imp_score_type (string) – Importance score type

machine.learn.skl_utils.save_json_fmt(outdir, _id, fname, content)[source]

Save results into json format.

Parameters
  • outdir (string) – Path of output directory

  • _id (string) – Experiment ID in Aliro

  • fname (string) – File name

  • content (list or directory) – Content for results

Return type

None

machine.learn.skl_utils.plot_confusion_matrix(tmpdir, _id, X, y, class_names, cv_scores, figure_export)[source]

Make plot for confusion matrix.

Parameters
  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • X (np.darray/pd.DataFrame) – Features in training dataset

  • y (np.darray/pd.DataFrame) – Target in training dataset

  • class_names (list) – List of class names

  • cv_scores (dictionary) – Return from sklearn.model_selection.cross_validate

  • figure_export (boolean) – If true, then export roc curve plot

Return type

None

machine.learn.skl_utils.plot_learning_curve(tmpdir, _id, model, features, target, cv, return_times=True)[source]

Make learning curve.

Parameters
  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • model (user specified model) –

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

  • cv (int, cross-validation generator or an iterable) –

Return type

None

machine.learn.skl_utils.plot_pca_2d(tmpdir, _id, features, target)[source]

Make PCA on 2D.

Parameters
  • tmpdir (string) – Temporary directory for saving 2d pca plot and json file

  • _id (string) – Experiment ID in Aliro

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

Return type

None

machine.learn.skl_utils.plot_tsne_2d(tmpdir, _id, features, target)[source]

Make tsne on 2D.

Parameters
  • tmpdir (string) – Temporary directory for saving 2d t-sne plot and json file

  • _id (string) – Experiment ID in Aliro

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

Return type

None

machine.learn.skl_utils.plot_roc_curve(tmpdir, _id, X, y, cv_scores, figure_export)[source]

Plot ROC Curve. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param X: Features in training dataset :type X: np.darray/pd.DataFrame :param y: Target in training dataset :type y: np.darray/pd.DataFrame :param cv_scores: Return from sklearn.model_selection.cross_validate :type cv_scores: dictionary :param figure_export: If true, then export roc curve plot :type figure_export: boolean

Return type

None

machine.learn.skl_utils.plot_imp_score(tmpdir, _id, coefs, feature_names, imp_score_type)[source]

Plot importance scores for features. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param coefs: Feature importance scores :type coefs: array :param feature_names: List of feature names :type feature_names: np.array :param imp_score_type: Importance score type :type imp_score_type: string

Returns

  • top_features (list) – Top features with high importance score

  • indices (ndarray) – Array of indices of top important features

machine.learn.skl_utils.plot_dot_plot(tmpdir, _id, features, target, top_features_name, indices, random_state, mode)[source]

Make dot plot for based on decision tree.

Parameters
  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

  • top_features (list) – Top feature_names

  • indices (ndarray) – Array of indices of top important features

  • random_state (int) – Random seed for permuation importances

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

Returns

Test score from fitting decision tree on top important feat’

Return type

dtree_train_score, float

machine.learn.skl_utils.export_model(tmpdir, _id, model, filename, target_name, mode='classification', random_state=42)[source]

export model as a pickle file and generate a scripts for using the pickled model.

Parameters
  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • model (scikit-learn estimator) – A fitted scikit-learn model

  • filename (string) – File name of input dataset

  • target_name (string) – Target name in input data

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

  • random_state (int) – Random seed in model

Return type

None

machine.learn.skl_utils.generate_export_codes(pickle_file_name, model, filename, target_name, mode='classification', random_state=42)[source]

Generate all library import calls for use in stand alone python scripts.

Parameters
  • pickle_file_name (string) – Pickle file name for a fitted scikit-learn estimator

  • model (scikit-learn estimator) – A fitted scikit-learn model

  • filename (string) – File name of input dataset

  • target_name (string) – Target name in input data

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

  • random_state (int) – Random seed in model

Returns

pipeline_text – The Python scripts for applying the current optimized pipeline in stand-alone python environment

Return type

String