class, api_path=None, extra_payload={}, user='testuser', rec_score_file='rec_state.obj', verbose=True, warm_start=False, n_recs=1, datasets=False, use_knowledgebase=False, term_condition='n_recs', max_time=5)[source]

AI managing agent for Aliro.

Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.

  • rec_class – ai.BaseRecommender - recommender to use

  • api_path – string - path to the lab api server

  • extra_payload – dict - any additional payload that needs to be specified

  • user – string - test user

  • rec_score_file – file - pickled score file to keep persistent scores between sessions

  • verbose – Boolean

  • warm_start – Boolean - if true, attempt to load the ai state from the file provided by rec_score_file

  • n_recs – int - number of recommendations to make for each request

  • datasets – str or False - if not false, a comma seperated list of datasets to turn the ai on for at startup

  • use_pmlb_knowledgebase – Boolean


Check to see if any new AI requests have been submitted. If so, add them to self.request_queue.


Boolean - True if new AI requests have been submitted


Checks to see if new experiment results have been posted since the previous time step. If so, set them to self.new_data and return True.


Boolean - True if new results were found

generate_recommendations(datasetId, numOfRecs)[source]

Generate ml recommendation payloads for the given dataset.

:param datasetId :param numOfRecs

:returns list of maps that represent request payload objects


Return a pandas dataframe of metafeatures associated with the datasets in results_data.

Retireves metafeatures from self.dataset_mf_cache if they exist, otherwise queries the api and updates the cache.


results_data – experiment results with associated datasets


Initilize classification and regression recommenders


Bootstrap the recommenders with the knowledgebase.


Loads pickled score file and recommender model.

TODO: test that this still works


Save ML+P scores in pickle or to DB

TODO: test that this still works


Attempt to send a recommendation to the lab server. If any error other then a no capacity error occurs, throw an exception.


rec_payload – dictionary - the payload describing the experiment

:return bool - true if successfully sent, false if no machine capacity available


Update recommender models based on new experiment results in self.new_data, and then clear self.new_data.


Base Recommender

class ai.recommender.base.BaseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Base recommender for Aliro

The BaseRecommender is not intended to be used directly; it is a skeleton class defining the interface for future recommenders within the Aliro project.

  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

  • ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’

  • knowledgebase_results (Pandas DataFrame or None) – Initial knowledgebase results data. If not None and not loading a serialized recommender, the recommender will initialize and train on this data. If loading a serialized recommender, this is the knowledgebase that accompanies it.

  • knowledgebase_metafeatures (Pandas DataFrame or None) – Initial knowledgebase metafeatures data. If loading a serialized recommender, this is the knowledgebase that accompanies it.

  • serialized_rec_directory (string or None) – Name of the directory to save/load a serialized recommender. Default directory is “.”

  • serialized_rec_filename (string or None) – Name of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • load_serialized_rec (str, "always", "never", "if_exists") –

    Whether to attempt to load a serialized recommender:

    ”if_exists” - If a serialized recomender exsists at the specified path, load it. “always” - Always load a serialized recommender. Throw an exception if no serialized recommender exists. “never” - Never load a serialized recommender.

load(filename=None, knowledgebase=None)[source]

Load a saved recommender state.

  • filename – string or None Name of file to load

  • knowledgebase

    string or None DataFrame with columns corresponding to:

    ’dataset’ ‘algorithm’ ‘parameters’ self.metric

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

  • dataset_mf (DataFrame) – metafeatures of the dataset represented by dataset_id


Save the current recommender.


filename – string or None Name of file to load

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations.

  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.

  • source (string) – if ‘pennai’, will update tally of trained dataset models

update_and_save(results_data, results_mf=None, source='pennai', filename=None)[source]

runs self.update() and

  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.

  • source (string) – if ‘pennai’, will update tally of trained dataset models

Random Recommender

class ai.recommender.random_recommender.RandomRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Aliro random recommender.

Recommends random machine learning algorithms and parameters from the possible algorithms fetched from the server.

  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

  • ml_p (Dataframe) – Contains all the machine learning / algorithm combinations available for recommendation.

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of len n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations.

  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.

Average Recommender

class ai.recommender.average_recommender.AverageRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Aliro average recommender.

Recommends machine learning algorithms and parameters based on their average performance across all evaluated datasets.

  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations based on overall performance in results_data.

Updates self.scores


results_data (DataFrame with columns corresponding to:) – ‘dataset’ ‘algorithm’ ‘parameters’ self.metric

KNN Recommender

class ai.recommender.knn_meta_recommender.KNNMetaRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Aliro KNN meta recommender.

Recommends machine learning algorithms and parameters as follows:
  • store the best ML + P on every dataset.

  • given a new dataset, measure its distance to all results in metafeature space.

  • recommend ML + P with best performance on closest dataset.

  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

  • ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’


Initialize recommendation system.

best_model_prediction(dataset_id, df_mf, n_recs=1)[source]

Predict scores over many variations of ML+P and pick the best

recommend(dataset_id, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf, source='pennai')[source]

Update ML / Parameter recommendations.

  • results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric

  • results_mf (DataFrame) – columns corresponding to metafeatures of each dataset in results_data.


Stores best ML-P on each dataset.

Surprise Recommenders

We have a customized version of the Surprise library available here.

class ai.recommender.surprise_recommenders.SurpriseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Class to support generic recommenders from the Surprise library. Not intended to be used as a standalone class.

  • ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.

  • metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

load(filename=None, knowledgebase=None)[source]

Load a saved recommender state.


Initialize recommendation system.

recommend(dataset_id, n_recs=1, dataset_mf=None)[source]

Return a model and parameter values expected to do best on dataset.

  • dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.

  • n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]

Update ML / Parameter recommendations based on overall performance in results_data.

  • results_data – DataFrame with columns corresponding to: ‘dataset’ ‘algorithm’ ‘parameters’ self.metric

  • results_mf – metafeatures for the datasets in results_data

class ai.recommender.surprise_recommenders.CoClusteringRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via CoClustering, see

class ai.recommender.surprise_recommenders.KNNWithMeansRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via KNNWithMeans, see

class ai.recommender.surprise_recommenders.KNNDatasetRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via KNN with clusters defined over datasets, see

class ai.recommender.surprise_recommenders.KNNMLRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via KNN with clusters defined over algorithms, see

class ai.recommender.surprise_recommenders.SlopeOneRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

Generates recommendations via SlopeOne, see

class ai.recommender.surprise_recommenders.SVDRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]

SVD recommender. see Recommends machine learning algorithms and parameters using the SVD algorithm.

  • stores ML + P and every dataset.

  • learns a matrix factorization on the non-missing data.

  • given a dataset, estimates the rankings of all ML+P and returns the top n_recs.

Note that we use a custom online version of SVD found here:

Scikit-learn API for Aliro engine

This is the API for using Aliro engine as a standalone python package.

class ai.sklearn.pennai_sklearn.PennAI(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]

Aliro standalone sklearn wrapper.

Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.

  • rec_class – ai.BaseRecommender - recommender to use

  • verbose – int, 0 quite, 1 info, 2 debug

  • serialized_rec – string or None Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • scoring – str - scoring for evaluating recommendations

  • n_recs – int - number of recommendations to make for each iteration

  • n_iters – int = total number of iteration

  • knowledgebase – file - input file for knowledgebase

  • kb_metafeatures – inputfile for metafeature

  • config_dict – python dictionary - inputfile for hyperparams space for all ML algorithms

  • ensemble – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.

  • max_time_mins – maximum time in minutes that Aliro can run

  • stopping_criteria – int, optional A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.

  • random_state – random state for recommenders

  • n_jobs – int (default: 1) The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

fit(X, y)[source]

Trains Aliro on X,y.

  • X (array-like {n_samples, n_features}) – Feature matrix of the training set

  • y (ndarray of shape (n_samples,)) – Target of the training set



Return type



Predictions for X.


X (array-like {n_samples, n_features}) – Feature matrix of the testing set


y – The predicted target.

Return type

ndarray of shape (n_samples,)

score(X, y)[source]

Return the score on the given testing data using the user-specified scoring function. :param X: Feature matrix of the testing set :type X: array-like {n_samples, n_features} :param y: Target of the testing set :type y: ndarray of shape (n_samples,)


accuracy_score – The estimated test set accuracy

Return type


Two classes below can be imported from pennai.sklearn after installing pennaipy package via pip. (see User Guide of PennAIpy)

class ai.sklearn.pennai_sklearn.PennAIClassifier(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]

Aliro engine for classification tasks.

Read more in the User Guide of PennAIpy.

  • rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.

  • verbose (int) – 0 quite, 1 info, 2 debug

  • serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • scoring (str) – scoring for evaluating recommendations. It could be “accuracy”, “balanced_accuracy”, “f1”, “f1_macro”

  • n_recs (int) – number of recommendations to make for each iteration

  • n_iters (int) – total number of iterations

  • knowledgebase (str) – input file for knowledgebase

  • kb_metafeatures (str) – input file for metafeature

  • config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms

  • ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.

  • max_time_mins – maximum time in minutes that Aliro can run

  • stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.

  • random_state (int) – random state for recommenders

  • n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

class ai.sklearn.pennai_sklearn.PennAIRegressor(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]

Aliro engine for regression tasks.

Read more in the User Guide of PennAIpy.

  • rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.

  • verbose (int) – 0 quite, 1 info, 2 debug

  • serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.

  • scoring (str) – scoring for evaluating recommendations. It could be “r2”, “explained_variance”, “neg_mean_squared_error”

  • n_recs (int) – number of recommendations to make for each iteration

  • n_iters (int) – total number of iterations

  • knowledgebase (str) – input file for knowledgebase

  • kb_metafeatures (str) – input file for metafeature

  • config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms

  • ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.

  • max_time_mins – maximum time in minutes that Aliro can run

  • stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.

  • random_state (int) – random state for recommenders

  • n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.


This is the API for building ML models. Aliro uses scikit-learn to achieve this.


These methods control data flow from the server to and from sklearn models.

class machine.learn.io_utils.Experiment(args, basedir='.')[source]

Get input data based on experiment ID (_id) from Aliro API.


input_data – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset

Return type

pandas.Dataframe or list of two pandas.Dataframe


Build scikit learn method based on arguments from Aliro API.


  • model (scikit-learn Estimator) – a machine learning model with scikit-learn API

  • method_type (string) – ‘classification’: classification model ‘regression’: regression model


Get all machine learning algorithm’s information from Aliro API This information should be the same with projects.json.


projects – A dict of all machine learning algorithm’s information

Return type



Parse arguments for machine learning algorithm.


  • args (dict) – Arguments of a experiment from Aliro API

  • param_grid (dict) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

machine.learn.io_utils.get_input_data(_id, tmpdir)[source]

Get input dataset information from Aliro API.

  • _id (string) – Experiment ID in Aliro

  • tmpdir (string) – Path of temporary directory


  • input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame. The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset

  • data_info (dict) –

    • target_name: string, target column name

    • filename: list, filename(s)

    • categories: list, categorical feature name(s)

    • ordinals: dict

      • keys: categorical feature name(s)

      • values: categorical values


Attempt to retrieve dataset file. If the file is corrupt or an error response is returned, it will rasie an ValueError.

  • file_id (string) – File ID from the Aliro database

  • Return (string) – Dataset strings which will be read by pandas and converted to pd.DataFrame

machine.learn.io_utils.check_column(column_name, dataframe)[source]

check if a column exists in Pandas DataFrame. :param column_name: column name :type column_name: string :param dataframe: input dataset DataFrame :type dataframe: pandas.DataFrame

Return type



Convert argument to boolean type. :param val: Value of a parameter in string type :type val: string


_ – Converted value in boolean type

Return type



Convert nono argument to None. :param val: Value of a parameter in string type :type val: string


_ – If input value if “none”, then the function will return None, otherwise it will retune string.

Return type



Return convertion function for input type.


param_type (string or list) – string, type of a parameter which is defined in projects.json list, list of parameter types (for parameter supportting multiple input types)


known_types[type] – Function for converting argument from Aliro UI for assigning to scikit-learn estimator

Return type


Scikit-learn Utils

These methods generate sklearn models and evaluate them.

machine.learn.skl_utils.balanced_accuracy(y_true, y_pred)[source]

Default scoring function of classification: balanced accuracy. Balanced accuracy computes each class’ accuracy on a per-class basis using a one-vs-rest encoding, then computes an unweighted average of the class accuracies.

  • y_true (numpy.ndarray {n_samples}) – True class labels

  • y_pred (numpy.ndarray {n_samples}) – Predicted class labels by the estimator


fitness – Returns a float value indicating balanced accuracy 0.5 is as good as chance, and 1.0 is perfect predictive accuracy

Return type


machine.learn.skl_utils.generate_results(model, input_data, tmpdir, _id, target_name='class', mode='classification', figure_export=True, random_state=None, filename=['test_dataset'], categories=None, ordinals=None, encoding_strategy='OneHotEncoder', param_grid={})[source]

Generate reaults for applying a model on dataset in Aliro.

  • model (scikit-learn Estimator) – A machine learning model following scikit-learn API

  • input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use 10 fold CV to estimate train/test scroes. list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset

  • target_name (string) – Target name in input data

  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment id

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

  • figure_export (boolean) – If figure_export is True, the figures will be generated and exported.

  • random_state (int) – Random seed

  • filename (list) – Filename for input dataset

  • categories (list) – List of column names for one-hot encoding

  • ordinals (dict) –

    Dictionary of ordinal features:

    Keys of dictionary: categorical feature name(s) Values of dictionary: categorical values

  • encoding_strategy (string) – Encoding strategy for categorical features defined in projects.json

  • param_grid (dict) – If grid_search is non-empty dictionary, then the experiment will do parameter tuning via GridSearchCV. It should report best result to UI and save all results to knowlegde base.

Return type


machine.learn.skl_utils.get_col_idx(feature_names_list, columns)[source]

Get unique indexes of columns based on list of column names. :param feature_names_list: List of column names on dataset :type feature_names_list: list :param columns: List of selected column names :type columns: list


col_idx – list of selected column indexes

Return type


machine.learn.skl_utils.setup_model_params(model, parameter_name, value)[source]

Assign value to a parameter in a model. :param model: Machine learning model following scikit-learn API :type model: scikit-learn Estimator :param parameter_name: Parameter name in the scikit-learn model :type parameter_name: string :param value: Values for assigning to the parameter :type value: object


model – A new scikit-learn model with a updated parameter

Return type

scikit-learn Estimator

machine.learn.skl_utils.compute_imp_score(model, metric, features, target, random_state)[source]

Compute permuation importance scores for features.

  • tmpdir (string) – Temporary directory for saving experiment results

  • model (scikit-learn Estimator) – A fitted scikit-learn model

  • metric (str, callable) – The metric for evaluating the feature importance through permutation. By default, the strings ‘accuracy’ is recommended for classifiers and the string ‘r2’ is recommended for regressors. Optionally, a custom scoring function (e.g., metric=scoring_func) that accepts two arguments, y_true and y_pred, which have similar shape to the y array.

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

  • random_state (int) – Random seed for permuation importances


  • coefs (np.darray) – Feature importance scores

  • imp_score_type (string) – Importance score type

machine.learn.skl_utils.save_json_fmt(outdir, _id, fname, content)[source]

Save results into json format.

  • outdir (string) – Path of output directory

  • _id (string) – Experiment ID in Aliro

  • fname (string) – File name

  • content (list or directory) – Content for results

Return type


machine.learn.skl_utils.plot_confusion_matrix(tmpdir, _id, X, y, class_names, cv_scores, figure_export)[source]

Make plot for confusion matrix.

  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • X (np.darray/pd.DataFrame) – Features in training dataset

  • y (np.darray/pd.DataFrame) – Target in training dataset

  • class_names (list) – List of class names

  • cv_scores (dictionary) – Return from sklearn.model_selection.cross_validate

  • figure_export (boolean) – If true, then export roc curve plot

Return type


machine.learn.skl_utils.plot_learning_curve(tmpdir, _id, model, features, target, cv, return_times=True)[source]

Make learning curve.

  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • model (user specified model) –

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

  • cv (int, cross-validation generator or an iterable) –

Return type


machine.learn.skl_utils.plot_pca_2d(tmpdir, _id, features, target)[source]

Make PCA on 2D.

  • tmpdir (string) – Temporary directory for saving 2d pca plot and json file

  • _id (string) – Experiment ID in Aliro

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

Return type


machine.learn.skl_utils.plot_tsne_2d(tmpdir, _id, features, target)[source]

Make tsne on 2D.

  • tmpdir (string) – Temporary directory for saving 2d t-sne plot and json file

  • _id (string) – Experiment ID in Aliro

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

Return type


machine.learn.skl_utils.plot_roc_curve(tmpdir, _id, X, y, cv_scores, figure_export)[source]

Plot ROC Curve. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param X: Features in training dataset :type X: np.darray/pd.DataFrame :param y: Target in training dataset :type y: np.darray/pd.DataFrame :param cv_scores: Return from sklearn.model_selection.cross_validate :type cv_scores: dictionary :param figure_export: If true, then export roc curve plot :type figure_export: boolean

Return type


machine.learn.skl_utils.plot_imp_score(tmpdir, _id, coefs, feature_names, imp_score_type)[source]

Plot importance scores for features. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param coefs: Feature importance scores :type coefs: array :param feature_names: List of feature names :type feature_names: np.array :param imp_score_type: Importance score type :type imp_score_type: string


  • top_features (list) – Top features with high importance score

  • indices (ndarray) – Array of indices of top important features

machine.learn.skl_utils.plot_dot_plot(tmpdir, _id, features, target, top_features_name, indices, random_state, mode)[source]

Make dot plot for based on decision tree.

  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • features (np.darray/pd.DataFrame) – Features in training dataset

  • target (np.darray/pd.DataFrame) – Target in training dataset

  • top_features (list) – Top feature_names

  • indices (ndarray) – Array of indices of top important features

  • random_state (int) – Random seed for permuation importances

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis


Test score from fitting decision tree on top important feat’

Return type

dtree_train_score, float

machine.learn.skl_utils.export_model(tmpdir, _id, model, filename, target_name, mode='classification', random_state=42)[source]

export model as a pickle file and generate a scripts for using the pickled model.

  • tmpdir (string) – Temporary directory for saving experiment results

  • _id (string) – Experiment ID in Aliro

  • model (scikit-learn estimator) – A fitted scikit-learn model

  • filename (string) – File name of input dataset

  • target_name (string) – Target name in input data

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

  • random_state (int) – Random seed in model

Return type


machine.learn.skl_utils.generate_export_codes(pickle_file_name, model, filename, target_name, mode='classification', random_state=42)[source]

Generate all library import calls for use in stand alone python scripts.

  • pickle_file_name (string) – Pickle file name for a fitted scikit-learn estimator

  • model (scikit-learn estimator) – A fitted scikit-learn model

  • filename (string) – File name of input dataset

  • target_name (string) – Target name in input data

  • mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

  • random_state (int) – Random seed in model


pipeline_text – The Python scripts for applying the current optimized pipeline in stand-alone python environment

Return type
