AI API¶

AI¶

class ai.ai.AI(rec_class=None, api_path=None, extra_payload={}, user='testuser', rec_score_file='rec_state.obj', verbose=True, warm_start=False, n_recs=1, datasets=False, use_knowledgebase=False, term_condition='n_recs', max_time=5)[source]¶

AI managing agent for Aliro.

Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.

Parameters

rec_class – ai.BaseRecommender - recommender to use
api_path – string - path to the lab api server
extra_payload – dict - any additional payload that needs to be specified
user – string - test user
rec_score_file – file - pickled score file to keep persistent scores between sessions
verbose – Boolean
warm_start – Boolean - if true, attempt to load the ai state from the file provided by rec_score_file
n_recs – int - number of recommendations to make for each request
datasets – str or False - if not false, a comma seperated list of datasets to turn the ai on for at startup
use_pmlb_knowledgebase – Boolean

check_requests()[source]¶

Check to see if any new AI requests have been submitted. If so, add them to self.request_queue.

Returns: Boolean - True if new AI requests have been submitted

check_results()[source]¶

Checks to see if new experiment results have been posted since the previous time step. If so, set them to self.new_data and return True.

Returns: Boolean - True if new results were found

generate_recommendations(datasetId, numOfRecs)[source]¶

Generate ml recommendation payloads for the given dataset.

:param datasetId :param numOfRecs

:returns list of maps that represent request payload objects

get_results_metafeatures(results_data)[source]¶

Return a pandas dataframe of metafeatures associated with the datasets in results_data.

Retireves metafeatures from self.dataset_mf_cache if they exist, otherwise queries the api and updates the cache.

Parameters: results_data – experiment results with associated datasets

initialize_recommenders(rec_class)[source]¶: Initilize classification and regression recommenders

load_knowledgebase()[source]¶: Bootstrap the recommenders with the knowledgebase.

load_state()[source]¶

Loads pickled score file and recommender model.

TODO: test that this still works

save_state()[source]¶

Save ML+P scores in pickle or to DB

TODO: test that this still works

transfer_rec(rec_payload)[source]¶

Attempt to send a recommendation to the lab server. If any error other then a no capacity error occurs, throw an exception.

Parameters: rec_payload – dictionary - the payload describing the experiment

:return bool - true if successfully sent, false if no machine capacity available

update_recommender()[source]¶: Update recommender models based on new experiment results in self.new_data, and then clear self.new_data.

Recommenders¶

Base Recommender¶

class ai.recommender.base.BaseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶

Base recommender for Aliro

The BaseRecommender is not intended to be used directly; it is a skeleton class defining the interface for future recommenders within the Aliro project.

Parameters

ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’
knowledgebase_results (Pandas DataFrame or None) – Initial knowledgebase results data. If not None and not loading a serialized recommender, the recommender will initialize and train on this data. If loading a serialized recommender, this is the knowledgebase that accompanies it.
knowledgebase_metafeatures (Pandas DataFrame or None) – Initial knowledgebase metafeatures data. If loading a serialized recommender, this is the knowledgebase that accompanies it.
serialized_rec_directory (string or None) – Name of the directory to save/load a serialized recommender. Default directory is “.”
serialized_rec_filename (string or None) – Name of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
load_serialized_rec (str, "always", "never", "if_exists") –

Whether to attempt to load a serialized recommender:
”if_exists” - If a serialized recomender exsists at the specified path, load it. “always” - Always load a serialized recommender. Throw an exception if no serialized recommender exists. “never” - Never load a serialized recommender.

load(filename=None, knowledgebase=None)[source]¶

Load a saved recommender state.

Parameters

filename – string or None Name of file to load
knowledgebase –
string or None DataFrame with columns corresponding to:

’dataset’ ‘algorithm’ ‘parameters’ self.metric

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]¶

Return a model and parameter values expected to do best on dataset.

Parameters

dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.
dataset_mf (DataFrame) – metafeatures of the dataset represented by dataset_id

save(filename=None)[source]¶

Save the current recommender.

Parameters: filename – string or None Name of file to load

update(results_data, results_mf=None, source='pennai')[source]¶

Update ML / Parameter recommendations.

Parameters

results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric
results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.
source (string) – if ‘pennai’, will update tally of trained dataset models

update_and_save(results_data, results_mf=None, source='pennai', filename=None)[source]¶

runs self.update() and self.save.

Parameters

results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric
results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.
source (string) – if ‘pennai’, will update tally of trained dataset models

Random Recommender¶

class ai.recommender.random_recommender.RandomRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶

Aliro random recommender.

Recommends random machine learning algorithms and parameters from the possible algorithms fetched from the server.

Parameters

ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
ml_p (Dataframe) – Contains all the machine learning / algorithm combinations available for recommendation.

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]¶

Return a model and parameter values expected to do best on dataset.

Parameters

dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of len n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]¶

Update ML / Parameter recommendations.

Parameters

results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric
results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.

Average Recommender¶

class ai.recommender.average_recommender.AverageRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶

Aliro average recommender.

Recommends machine learning algorithms and parameters based on their average performance across all evaluated datasets.

Parameters

ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]¶

Return a model and parameter values expected to do best on dataset.

Parameters

dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]¶

Update ML / Parameter recommendations based on overall performance in results_data.

Updates self.scores

Parameters: results_data (DataFrame with columns corresponding to:) – ‘dataset’ ‘algorithm’ ‘parameters’ self.metric

KNN Recommender¶

class ai.recommender.knn_meta_recommender.KNNMetaRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶

Aliro KNN meta recommender.

Recommends machine learning algorithms and parameters as follows:

store the best ML + P on every dataset.
given a new dataset, measure its distance to all results in metafeature space.
recommend ML + P with best performance on closest dataset.

Parameters

ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’

all_dataset_mf¶: Initialize recommendation system.

best_model_prediction(dataset_id, df_mf, n_recs=1)[source]¶: Predict scores over many variations of ML+P and pick the best

recommend(dataset_id, n_recs=1, dataset_mf=None)[source]¶

Return a model and parameter values expected to do best on dataset.

Parameters

dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf, source='pennai')[source]¶

Update ML / Parameter recommendations.

Parameters

results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric
results_mf (DataFrame) – columns corresponding to metafeatures of each dataset in results_data.

update_model(results_data)[source]¶: Stores best ML-P on each dataset.

Surprise Recommenders¶

We have a customized version of the Surprise library available here.

class ai.recommender.surprise_recommenders.SurpriseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶

Class to support generic recommenders from the Surprise library. Not intended to be used as a standalone class.

Parameters

ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.

load(filename=None, knowledgebase=None)[source]¶: Load a saved recommender state.

max_epochs¶: Initialize recommendation system.

recommend(dataset_id, n_recs=1, dataset_mf=None)[source]¶

Return a model and parameter values expected to do best on dataset.

Parameters

dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.

update(results_data, results_mf=None, source='pennai')[source]¶

Update ML / Parameter recommendations based on overall performance in results_data.

Parameters

results_data – DataFrame with columns corresponding to: ‘dataset’ ‘algorithm’ ‘parameters’ self.metric
results_mf – metafeatures for the datasets in results_data

class ai.recommender.surprise_recommenders.CoClusteringRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶: Generates recommendations via CoClustering, see https://surprise.readthedocs.io/en/stable/co_clustering.html

class ai.recommender.surprise_recommenders.KNNWithMeansRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶: Generates recommendations via KNNWithMeans, see https://surprise.readthedocs.io/en/stable/knn_inspired.html

class ai.recommender.surprise_recommenders.KNNDatasetRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶: Generates recommendations via KNN with clusters defined over datasets, see https://surprise.readthedocs.io/en/stable/knn_inspired.html

class ai.recommender.surprise_recommenders.KNNMLRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶: Generates recommendations via KNN with clusters defined over algorithms, see https://surprise.readthedocs.io/en/stable/knn_inspired.html

class ai.recommender.surprise_recommenders.SlopeOneRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶: Generates recommendations via SlopeOne, see https://surprise.readthedocs.io/en/stable/slope_one.html

class ai.recommender.surprise_recommenders.SVDRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶

SVD recommender. see https://surprise.readthedocs.io/en/stable/matrix_factorization.html Recommends machine learning algorithms and parameters using the SVD algorithm.

stores ML + P and every dataset.

learns a matrix factorization on the non-missing data.

given a dataset, estimates the rankings of all ML+P and returns the top n_recs.

Note that we use a custom online version of SVD found here: https://github.com/lacava/surprise

Scikit-learn API for Aliro engine¶

This is the API for using Aliro engine as a standalone python package.

class ai.sklearn.pennai_sklearn.PennAI(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]¶

Aliro standalone sklearn wrapper.

Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.

Parameters

rec_class – ai.BaseRecommender - recommender to use
verbose – int, 0 quite, 1 info, 2 debug
serialized_rec – string or None Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
scoring – str - scoring for evaluating recommendations
n_recs – int - number of recommendations to make for each iteration
n_iters – int = total number of iteration
knowledgebase – file - input file for knowledgebase
kb_metafeatures – inputfile for metafeature
config_dict – python dictionary - inputfile for hyperparams space for all ML algorithms
ensemble – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.
max_time_mins – maximum time in minutes that Aliro can run
stopping_criteria – int, optional A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.
random_state – random state for recommenders
n_jobs – int (default: 1) The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

fit(X, y)[source]¶

Trains Aliro on X,y.

Parameters

X (array-like {n_samples, n_features}) – Feature matrix of the training set
y (ndarray of shape (n_samples,)) – Target of the training set

Returns

self

Return type

object

predict(X)[source]¶

Predictions for X.

Parameters: X (array-like {n_samples, n_features}) – Feature matrix of the testing set
Returns: y – The predicted target.
Return type: ndarray of shape (n_samples,)

score(X, y)[source]¶

Return the score on the given testing data using the user-specified scoring function. :param X: Feature matrix of the testing set :type X: array-like {n_samples, n_features} :param y: Target of the testing set :type y: ndarray of shape (n_samples,)

Returns: accuracy_score – The estimated test set accuracy
Return type: float

Two classes below can be imported from pennai.sklearn after installing pennaipy package via pip. (see User Guide of PennAIpy)

class ai.sklearn.pennai_sklearn.PennAIClassifier(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]¶

Aliro engine for classification tasks.

Read more in the User Guide of PennAIpy.

Parameters

rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.
verbose (int) – 0 quite, 1 info, 2 debug
serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
scoring (str) – scoring for evaluating recommendations. It could be “accuracy”, “balanced_accuracy”, “f1”, “f1_macro”
n_recs (int) – number of recommendations to make for each iteration
n_iters (int) – total number of iterations
knowledgebase (str) – input file for knowledgebase
kb_metafeatures (str) – input file for metafeature
config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms
ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.
max_time_mins – maximum time in minutes that Aliro can run
stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.
random_state (int) – random state for recommenders
n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

class ai.sklearn.pennai_sklearn.PennAIRegressor(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]¶

Aliro engine for regression tasks.

Read more in the User Guide of PennAIpy.

Parameters

rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.
verbose (int) – 0 quite, 1 info, 2 debug
serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
scoring (str) – scoring for evaluating recommendations. It could be “r2”, “explained_variance”, “neg_mean_squared_error”
n_recs (int) – number of recommendations to make for each iteration
n_iters (int) – total number of iterations
knowledgebase (str) – input file for knowledgebase
kb_metafeatures (str) – input file for metafeature
config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms
ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.
max_time_mins – maximum time in minutes that Aliro can run
stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.
random_state (int) – random state for recommenders
n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.

Learn¶

This is the API for building ML models. Aliro uses scikit-learn to achieve this.

IO¶

These methods control data flow from the server to and from sklearn models.

class machine.learn.io_utils.Experiment(args, basedir='.')[source]¶

get_input()[source]¶

Get input data based on experiment ID (_id) from Aliro API.

Returns: input_data – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset
Return type: pandas.Dataframe or list of two pandas.Dataframe

get_model()[source]¶

Build scikit learn method based on arguments from Aliro API.

Returns

model (scikit-learn Estimator) – a machine learning model with scikit-learn API
method_type (string) – ‘classification’: classification model ‘regression’: regression model

machine.learn.io_utils.get_projects()[source]¶

Get all machine learning algorithm’s information from Aliro API This information should be the same with projects.json.

Returns: projects – A dict of all machine learning algorithm’s information
Return type: dict

machine.learn.io_utils.parse_args()[source]¶

Parse arguments for machine learning algorithm.

Returns

args (dict) – Arguments of a experiment from Aliro API
param_grid (dict) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

machine.learn.io_utils.get_input_data(_id, tmpdir)[source]¶

Get input dataset information from Aliro API.

Parameters

_id (string) – Experiment ID in Aliro
tmpdir (string) – Path of temporary directory

Returns

input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame. The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset
data_info (dict) –
- target_name: string, target column name
- filename: list, filename(s)
- categories: list, categorical feature name(s)
- ordinals: dict
  - keys: categorical feature name(s)
  - values: categorical values

machine.learn.io_utils.get_file_data(file_id)[source]¶

Attempt to retrieve dataset file. If the file is corrupt or an error response is returned, it will rasie an ValueError.

Parameters

file_id (string) – File ID from the Aliro database
Return (string) – Dataset strings which will be read by pandas and converted to pd.DataFrame

machine.learn.io_utils.check_column(column_name, dataframe)[source]¶

check if a column exists in Pandas DataFrame. :param column_name: column name :type column_name: string :param dataframe: input dataset DataFrame :type dataframe: pandas.DataFrame

Return type: None

machine.learn.io_utils.bool_type(val)[source]¶

Convert argument to boolean type. :param val: Value of a parameter in string type :type val: string

Returns: _ – Converted value in boolean type
Return type: boolean

machine.learn.io_utils.none(val)[source]¶

Convert nono argument to None. :param val: Value of a parameter in string type :type val: string

Returns: _ – If input value if “none”, then the function will return None, otherwise it will retune string.
Return type: None

machine.learn.io_utils.get_type(param_type)[source]¶

Return convertion function for input type.

Parameters: param_type (string or list) – string, type of a parameter which is defined in projects.json list, list of parameter types (for parameter supportting multiple input types)
Returns: known_types[type] – Function for converting argument from Aliro UI for assigning to scikit-learn estimator
Return type: function

Scikit-learn Utils¶

These methods generate sklearn models and evaluate them.

machine.learn.skl_utils.balanced_accuracy(y_true, y_pred)[source]¶

Default scoring function of classification: balanced accuracy. Balanced accuracy computes each class’ accuracy on a per-class basis using a one-vs-rest encoding, then computes an unweighted average of the class accuracies.

Parameters

y_true (numpy.ndarray {n_samples}) – True class labels
y_pred (numpy.ndarray {n_samples}) – Predicted class labels by the estimator

Returns

fitness – Returns a float value indicating balanced accuracy 0.5 is as good as chance, and 1.0 is perfect predictive accuracy

Return type

float

machine.learn.skl_utils.generate_results(model, input_data, tmpdir, _id, target_name='class', mode='classification', figure_export=True, random_state=None, filename=['test_dataset'], categories=None, ordinals=None, encoding_strategy='OneHotEncoder', param_grid={})[source]¶

Generate reaults for applying a model on dataset in Aliro.

Parameters

model (scikit-learn Estimator) – A machine learning model following scikit-learn API
input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use 10 fold CV to estimate train/test scroes. list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset
target_name (string) – Target name in input data
tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment id
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis
figure_export (boolean) – If figure_export is True, the figures will be generated and exported.
random_state (int) – Random seed
filename (list) – Filename for input dataset
categories (list) – List of column names for one-hot encoding
ordinals (dict) –

Dictionary of ordinal features:
Keys of dictionary: categorical feature name(s) Values of dictionary: categorical values
encoding_strategy (string) – Encoding strategy for categorical features defined in projects.json
param_grid (dict) – If grid_search is non-empty dictionary, then the experiment will do parameter tuning via GridSearchCV. It should report best result to UI and save all results to knowlegde base.

Return type

None

machine.learn.skl_utils.get_col_idx(feature_names_list, columns)[source]¶

Get unique indexes of columns based on list of column names. :param feature_names_list: List of column names on dataset :type feature_names_list: list :param columns: List of selected column names :type columns: list

Returns: col_idx – list of selected column indexes
Return type: list

machine.learn.skl_utils.setup_model_params(model, parameter_name, value)[source]¶

Assign value to a parameter in a model. :param model: Machine learning model following scikit-learn API :type model: scikit-learn Estimator :param parameter_name: Parameter name in the scikit-learn model :type parameter_name: string :param value: Values for assigning to the parameter :type value: object

Returns: model – A new scikit-learn model with a updated parameter
Return type: scikit-learn Estimator

machine.learn.skl_utils.compute_imp_score(model, metric, features, target, random_state)[source]¶

Compute permuation importance scores for features.

Parameters

tmpdir (string) – Temporary directory for saving experiment results
model (scikit-learn Estimator) – A fitted scikit-learn model
metric (str, callable) – The metric for evaluating the feature importance through permutation. By default, the strings ‘accuracy’ is recommended for classifiers and the string ‘r2’ is recommended for regressors. Optionally, a custom scoring function (e.g., metric=scoring_func) that accepts two arguments, y_true and y_pred, which have similar shape to the y array.
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
random_state (int) – Random seed for permuation importances

Returns

coefs (np.darray) – Feature importance scores
imp_score_type (string) – Importance score type

machine.learn.skl_utils.save_json_fmt(outdir, _id, fname, content)[source]¶

Save results into json format.

Parameters

outdir (string) – Path of output directory
_id (string) – Experiment ID in Aliro
fname (string) – File name
content (list or directory) – Content for results

Return type

None

machine.learn.skl_utils.plot_confusion_matrix(tmpdir, _id, X, y, class_names, cv_scores, figure_export)[source]¶

Make plot for confusion matrix.

Parameters

tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
X (np.darray/pd.DataFrame) – Features in training dataset
y (np.darray/pd.DataFrame) – Target in training dataset
class_names (list) – List of class names
cv_scores (dictionary) – Return from sklearn.model_selection.cross_validate
figure_export (boolean) – If true, then export roc curve plot

Return type

None

machine.learn.skl_utils.plot_learning_curve(tmpdir, _id, model, features, target, cv, return_times=True)[source]¶

Make learning curve.

Parameters

tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
model (user specified model) –
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
cv (int, cross-validation generator or an iterable) –

Return type

None

machine.learn.skl_utils.plot_pca_2d(tmpdir, _id, features, target)[source]¶

Make PCA on 2D.

Parameters

tmpdir (string) – Temporary directory for saving 2d pca plot and json file
_id (string) – Experiment ID in Aliro
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset

Return type

None

machine.learn.skl_utils.plot_tsne_2d(tmpdir, _id, features, target)[source]¶

Make tsne on 2D.

Parameters

tmpdir (string) – Temporary directory for saving 2d t-sne plot and json file
_id (string) – Experiment ID in Aliro
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset

Return type

None

machine.learn.skl_utils.plot_roc_curve(tmpdir, _id, X, y, cv_scores, figure_export)[source]¶

Plot ROC Curve. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param X: Features in training dataset :type X: np.darray/pd.DataFrame :param y: Target in training dataset :type y: np.darray/pd.DataFrame :param cv_scores: Return from sklearn.model_selection.cross_validate :type cv_scores: dictionary :param figure_export: If true, then export roc curve plot :type figure_export: boolean

Return type: None

machine.learn.skl_utils.plot_imp_score(tmpdir, _id, coefs, feature_names, imp_score_type)[source]¶

Plot importance scores for features. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param coefs: Feature importance scores :type coefs: array :param feature_names: List of feature names :type feature_names: np.array :param imp_score_type: Importance score type :type imp_score_type: string

Returns

top_features (list) – Top features with high importance score
indices (ndarray) – Array of indices of top important features

machine.learn.skl_utils.plot_dot_plot(tmpdir, _id, features, target, top_features_name, indices, random_state, mode)[source]¶

Make dot plot for based on decision tree.

Parameters

tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
top_features (list) – Top feature_names
indices (ndarray) – Array of indices of top important features
random_state (int) – Random seed for permuation importances
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis

Returns

Test score from fitting decision tree on top important feat’

Return type

dtree_train_score, float

machine.learn.skl_utils.export_model(tmpdir, _id, model, filename, target_name, mode='classification', random_state=42)[source]¶

export model as a pickle file and generate a scripts for using the pickled model.

Parameters

tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
model (scikit-learn estimator) – A fitted scikit-learn model
filename (string) – File name of input dataset
target_name (string) – Target name in input data
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis
random_state (int) – Random seed in model

Return type

None

machine.learn.skl_utils.generate_export_codes(pickle_file_name, model, filename, target_name, mode='classification', random_state=42)[source]¶

Generate all library import calls for use in stand alone python scripts.

Parameters

pickle_file_name (string) – Pickle file name for a fitted scikit-learn estimator
model (scikit-learn estimator) – A fitted scikit-learn model
filename (string) – File name of input dataset
target_name (string) – Target name in input data
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis
random_state (int) – Random seed in model

Returns

pipeline_text – The Python scripts for applying the current optimized pipeline in stand-alone python environment

Return type

String