AI API¶
AI¶
- class ai.ai.AI(rec_class=None, api_path=None, extra_payload={}, user='testuser', rec_score_file='rec_state.obj', verbose=True, warm_start=False, n_recs=1, datasets=False, use_knowledgebase=False, term_condition='n_recs', max_time=5)[source]¶
AI managing agent for Aliro.
Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.
- Parameters
rec_class – ai.BaseRecommender - recommender to use
api_path – string - path to the lab api server
extra_payload – dict - any additional payload that needs to be specified
user – string - test user
rec_score_file – file - pickled score file to keep persistent scores between sessions
verbose – Boolean
warm_start – Boolean - if true, attempt to load the ai state from the file provided by rec_score_file
n_recs – int - number of recommendations to make for each request
datasets – str or False - if not false, a comma seperated list of datasets to turn the ai on for at startup
use_pmlb_knowledgebase – Boolean
- check_requests()[source]¶
Check to see if any new AI requests have been submitted. If so, add them to self.request_queue.
- Returns
Boolean - True if new AI requests have been submitted
- check_results()[source]¶
Checks to see if new experiment results have been posted since the previous time step. If so, set them to self.new_data and return True.
- Returns
Boolean - True if new results were found
- generate_recommendations(datasetId, numOfRecs)[source]¶
Generate ml recommendation payloads for the given dataset.
:param datasetId :param numOfRecs
:returns list of maps that represent request payload objects
- get_results_metafeatures(results_data)[source]¶
Return a pandas dataframe of metafeatures associated with the datasets in results_data.
Retireves metafeatures from self.dataset_mf_cache if they exist, otherwise queries the api and updates the cache.
- Parameters
results_data – experiment results with associated datasets
- load_state()[source]¶
Loads pickled score file and recommender model.
TODO: test that this still works
- transfer_rec(rec_payload)[source]¶
Attempt to send a recommendation to the lab server. If any error other then a no capacity error occurs, throw an exception.
- Parameters
rec_payload – dictionary - the payload describing the experiment
:return bool - true if successfully sent, false if no machine capacity available
Recommenders¶
Base Recommender¶
- class ai.recommender.base.BaseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Base recommender for Aliro
The BaseRecommender is not intended to be used directly; it is a skeleton class defining the interface for future recommenders within the Aliro project.
- Parameters
ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’
knowledgebase_results (Pandas DataFrame or None) – Initial knowledgebase results data. If not None and not loading a serialized recommender, the recommender will initialize and train on this data. If loading a serialized recommender, this is the knowledgebase that accompanies it.
knowledgebase_metafeatures (Pandas DataFrame or None) – Initial knowledgebase metafeatures data. If loading a serialized recommender, this is the knowledgebase that accompanies it.
serialized_rec_directory (string or None) – Name of the directory to save/load a serialized recommender. Default directory is “.”
serialized_rec_filename (string or None) – Name of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
load_serialized_rec (str, "always", "never", "if_exists") –
- Whether to attempt to load a serialized recommender:
”if_exists” - If a serialized recomender exsists at the specified path, load it. “always” - Always load a serialized recommender. Throw an exception if no serialized recommender exists. “never” - Never load a serialized recommender.
- load(filename=None, knowledgebase=None)[source]¶
Load a saved recommender state.
- Parameters
filename – string or None Name of file to load
knowledgebase –
string or None DataFrame with columns corresponding to:
’dataset’ ‘algorithm’ ‘parameters’ self.metric
- recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]¶
Return a model and parameter values expected to do best on dataset.
- Parameters
dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.
dataset_mf (DataFrame) – metafeatures of the dataset represented by dataset_id
- save(filename=None)[source]¶
Save the current recommender.
- Parameters
filename – string or None Name of file to load
- update(results_data, results_mf=None, source='pennai')[source]¶
Update ML / Parameter recommendations.
- Parameters
results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric
results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.
source (string) – if ‘pennai’, will update tally of trained dataset models
- update_and_save(results_data, results_mf=None, source='pennai', filename=None)[source]¶
runs self.update() and self.save.
- Parameters
results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric
results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.
source (string) – if ‘pennai’, will update tally of trained dataset models
Random Recommender¶
- class ai.recommender.random_recommender.RandomRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Aliro random recommender.
Recommends random machine learning algorithms and parameters from the possible algorithms fetched from the server.
- Parameters
ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
ml_p (Dataframe) – Contains all the machine learning / algorithm combinations available for recommendation.
- recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]¶
Return a model and parameter values expected to do best on dataset.
- Parameters
dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of len n_recs in order of estimators and parameters expected to do best.
- update(results_data, results_mf=None, source='pennai')[source]¶
Update ML / Parameter recommendations.
- Parameters
results_data (DataFrame) – columns corresponding to: ‘algorithm’ ‘parameters’ self.metric
results_mf (DataFrame, optional) – columns corresponding to metafeatures of each dataset in results_data.
Average Recommender¶
- class ai.recommender.average_recommender.AverageRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Aliro average recommender.
Recommends machine learning algorithms and parameters based on their average performance across all evaluated datasets.
- Parameters
ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
- recommend(dataset_id=None, n_recs=1, dataset_mf=None)[source]¶
Return a model and parameter values expected to do best on dataset.
- Parameters
dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.
KNN Recommender¶
- class ai.recommender.knn_meta_recommender.KNNMetaRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Aliro KNN meta recommender.
- Recommends machine learning algorithms and parameters as follows:
store the best ML + P on every dataset.
given a new dataset, measure its distance to all results in metafeature space.
recommend ML + P with best performance on closest dataset.
- Parameters
ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
ml_p (DataFrame (default: None)) – Contains all valid ML parameter combos, with columns ‘algorithm’ and ‘parameters’
- all_dataset_mf¶
Initialize recommendation system.
- best_model_prediction(dataset_id, df_mf, n_recs=1)[source]¶
Predict scores over many variations of ML+P and pick the best
- recommend(dataset_id, n_recs=1, dataset_mf=None)[source]¶
Return a model and parameter values expected to do best on dataset.
- Parameters
dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.
Surprise Recommenders¶
We have a customized version of the Surprise library available here.
- class ai.recommender.surprise_recommenders.SurpriseRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Class to support generic recommenders from the Surprise library. Not intended to be used as a standalone class.
- Parameters
ml_type (str, 'classifier' or 'regressor') – Recommending classifiers or regressors. Used to determine ML options.
metric (str (default: accuracy for classifiers, mse for regressors)) – The metric by which to assess performance on the datasets.
- max_epochs¶
Initialize recommendation system.
- recommend(dataset_id, n_recs=1, dataset_mf=None)[source]¶
Return a model and parameter values expected to do best on dataset.
- Parameters
dataset_id (string) – ID of the dataset for which the recommender is generating recommendations.
n_recs (int (default: 1), optional) – Return a list of length n_recs in order of estimators and parameters expected to do best.
- update(results_data, results_mf=None, source='pennai')[source]¶
Update ML / Parameter recommendations based on overall performance in results_data.
- Parameters
results_data – DataFrame with columns corresponding to: ‘dataset’ ‘algorithm’ ‘parameters’ self.metric
results_mf – metafeatures for the datasets in results_data
- class ai.recommender.surprise_recommenders.CoClusteringRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Generates recommendations via CoClustering, see https://surprise.readthedocs.io/en/stable/co_clustering.html
- class ai.recommender.surprise_recommenders.KNNWithMeansRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Generates recommendations via KNNWithMeans, see https://surprise.readthedocs.io/en/stable/knn_inspired.html
- class ai.recommender.surprise_recommenders.KNNDatasetRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Generates recommendations via KNN with clusters defined over datasets, see https://surprise.readthedocs.io/en/stable/knn_inspired.html
- class ai.recommender.surprise_recommenders.KNNMLRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Generates recommendations via KNN with clusters defined over algorithms, see https://surprise.readthedocs.io/en/stable/knn_inspired.html
- class ai.recommender.surprise_recommenders.SlopeOneRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
Generates recommendations via SlopeOne, see https://surprise.readthedocs.io/en/stable/slope_one.html
- class ai.recommender.surprise_recommenders.SVDRecommender(ml_type='classifier', metric=None, ml_p=None, random_state=None, knowledgebase_results=None, knowledgebase_metafeatures=None, load_serialized_rec='if_exists', serialized_rec_directory=None, serialized_rec_filename=None)[source]¶
SVD recommender. see https://surprise.readthedocs.io/en/stable/matrix_factorization.html Recommends machine learning algorithms and parameters using the SVD algorithm.
stores ML + P and every dataset.
learns a matrix factorization on the non-missing data.
given a dataset, estimates the rankings of all ML+P and returns the top n_recs.
Note that we use a custom online version of SVD found here: https://github.com/lacava/surprise
Scikit-learn API for Aliro engine¶
This is the API for using Aliro engine as a standalone python package.
- class ai.sklearn.pennai_sklearn.PennAI(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]¶
Aliro standalone sklearn wrapper.
Responsible for: - checking for user requests for recommendations, - checking for new results from experiments, - calling the recommender system to generate experiment recommendations, - posting the recommendations to the API. - handling communication with the API.
- Parameters
rec_class – ai.BaseRecommender - recommender to use
verbose – int, 0 quite, 1 info, 2 debug
serialized_rec – string or None Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
scoring – str - scoring for evaluating recommendations
n_recs – int - number of recommendations to make for each iteration
n_iters – int = total number of iteration
knowledgebase – file - input file for knowledgebase
kb_metafeatures – inputfile for metafeature
config_dict – python dictionary - inputfile for hyperparams space for all ML algorithms
ensemble – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.
max_time_mins – maximum time in minutes that Aliro can run
stopping_criteria – int, optional A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.
random_state – random state for recommenders
n_jobs – int (default: 1) The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.
- fit(X, y)[source]¶
Trains Aliro on X,y.
- Parameters
X (array-like {n_samples, n_features}) – Feature matrix of the training set
y (ndarray of shape (n_samples,)) – Target of the training set
- Returns
self
- Return type
object
- predict(X)[source]¶
Predictions for X.
- Parameters
X (array-like {n_samples, n_features}) – Feature matrix of the testing set
- Returns
y – The predicted target.
- Return type
ndarray of shape (n_samples,)
- score(X, y)[source]¶
Return the score on the given testing data using the user-specified scoring function. :param X: Feature matrix of the testing set :type X: array-like {n_samples, n_features} :param y: Target of the testing set :type y: ndarray of shape (n_samples,)
- Returns
accuracy_score – The estimated test set accuracy
- Return type
float
Two classes below can be imported from pennai.sklearn after installing pennaipy package via pip. (see User Guide of PennAIpy)
- class ai.sklearn.pennai_sklearn.PennAIClassifier(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]¶
Aliro engine for classification tasks.
Read more in the User Guide of PennAIpy.
- Parameters
rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.
verbose (int) – 0 quite, 1 info, 2 debug
serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
scoring (str) – scoring for evaluating recommendations. It could be “accuracy”, “balanced_accuracy”, “f1”, “f1_macro”
n_recs (int) – number of recommendations to make for each iteration
n_iters (int) – total number of iterations
knowledgebase (str) – input file for knowledgebase
kb_metafeatures (str) – input file for metafeature
config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms
ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.
max_time_mins – maximum time in minutes that Aliro can run
stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.
random_state (int) – random state for recommenders
n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.
- class ai.sklearn.pennai_sklearn.PennAIRegressor(rec_class=None, verbose=0, serialized_rec=None, scoring=None, n_recs=10, n_iters=10, knowledgebase=None, kb_metafeatures=None, config_dict=None, ensemble=None, max_time_mins=None, stopping_criteria=None, random_state=None, n_jobs=1)[source]¶
Aliro engine for regression tasks.
Read more in the User Guide of PennAIpy.
- Parameters
rec_class (ai.recommender.base.BaseRecommender or None) – Recommender to use in the Aliro engine. if it is None, Aliro will use SVDRecommender by default.
verbose (int) – 0 quite, 1 info, 2 debug
serialized_rec (string or None) – Path of the file to save/load a serialized recommender. If the filename is not provided, the default filename based on the recommender type, and metric, and knowledgebase used.
scoring (str) – scoring for evaluating recommendations. It could be “r2”, “explained_variance”, “neg_mean_squared_error”
n_recs (int) – number of recommendations to make for each iteration
n_iters (int) – total number of iterations
knowledgebase (str) – input file for knowledgebase
kb_metafeatures (str) – input file for metafeature
config_dict (python dictionary) – dictionary for hyperparameter search space for all ML algorithms
ensemble (int) – if it is a integer N, Aliro will use VotingClassifier/VotingRegressor to ensemble top N best models into one model.
max_time_mins – maximum time in minutes that Aliro can run
stopping_criteria (int) – A number of iterations without improvments in best metric. Stop recommendations early if the best metric does not improve in the number of iterations iterations.
random_state (int) – random state for recommenders
n_jobs (int) – The number of cores to dedicate to computing the scores with joblib. Assigning this parameter to -1 will dedicate as many cores as are available on your system.
Learn¶
This is the API for building ML models. Aliro uses scikit-learn to achieve this.
IO¶
These methods control data flow from the server to and from sklearn models.
- class machine.learn.io_utils.Experiment(args, basedir='.')[source]¶
- get_input()[source]¶
Get input data based on experiment ID (_id) from Aliro API.
- Returns
input_data – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset
- Return type
pandas.Dataframe or list of two pandas.Dataframe
- machine.learn.io_utils.get_projects()[source]¶
Get all machine learning algorithm’s information from Aliro API This information should be the same with projects.json.
- Returns
projects – A dict of all machine learning algorithm’s information
- Return type
dict
- machine.learn.io_utils.parse_args()[source]¶
Parse arguments for machine learning algorithm.
- Returns
args (dict) – Arguments of a experiment from Aliro API
param_grid (dict) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
- machine.learn.io_utils.get_input_data(_id, tmpdir)[source]¶
Get input dataset information from Aliro API.
- Parameters
_id (string) – Experiment ID in Aliro
tmpdir (string) – Path of temporary directory
- Returns
input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use train_test_split to make train/test splits list of two pandas.DataFrame. The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset
data_info (dict) –
target_name: string, target column name
filename: list, filename(s)
categories: list, categorical feature name(s)
ordinals: dict
keys: categorical feature name(s)
values: categorical values
- machine.learn.io_utils.get_file_data(file_id)[source]¶
Attempt to retrieve dataset file. If the file is corrupt or an error response is returned, it will rasie an ValueError.
- Parameters
file_id (string) – File ID from the Aliro database
Return (string) – Dataset strings which will be read by pandas and converted to pd.DataFrame
- machine.learn.io_utils.check_column(column_name, dataframe)[source]¶
check if a column exists in Pandas DataFrame. :param column_name: column name :type column_name: string :param dataframe: input dataset DataFrame :type dataframe: pandas.DataFrame
- Return type
None
- machine.learn.io_utils.bool_type(val)[source]¶
Convert argument to boolean type. :param val: Value of a parameter in string type :type val: string
- Returns
_ – Converted value in boolean type
- Return type
boolean
- machine.learn.io_utils.none(val)[source]¶
Convert nono argument to None. :param val: Value of a parameter in string type :type val: string
- Returns
_ – If input value if “none”, then the function will return None, otherwise it will retune string.
- Return type
None
- machine.learn.io_utils.get_type(param_type)[source]¶
Return convertion function for input type.
- Parameters
param_type (string or list) – string, type of a parameter which is defined in projects.json list, list of parameter types (for parameter supportting multiple input types)
- Returns
known_types[type] – Function for converting argument from Aliro UI for assigning to scikit-learn estimator
- Return type
function
Scikit-learn Utils¶
These methods generate sklearn models and evaluate them.
- machine.learn.skl_utils.balanced_accuracy(y_true, y_pred)[source]¶
Default scoring function of classification: balanced accuracy. Balanced accuracy computes each class’ accuracy on a per-class basis using a one-vs-rest encoding, then computes an unweighted average of the class accuracies.
- Parameters
y_true (numpy.ndarray {n_samples}) – True class labels
y_pred (numpy.ndarray {n_samples}) – Predicted class labels by the estimator
- Returns
fitness – Returns a float value indicating balanced accuracy 0.5 is as good as chance, and 1.0 is perfect predictive accuracy
- Return type
float
- machine.learn.skl_utils.generate_results(model, input_data, tmpdir, _id, target_name='class', mode='classification', figure_export=True, random_state=None, filename=['test_dataset'], categories=None, ordinals=None, encoding_strategy='OneHotEncoder', param_grid={})[source]¶
Generate reaults for applying a model on dataset in Aliro.
- Parameters
model (scikit-learn Estimator) – A machine learning model following scikit-learn API
input_data (pandas.Dataframe or list of two pandas.Dataframe) – pandas.DataFrame: Aliro will use 10 fold CV to estimate train/test scroes. list of two pandas.DataFrame: The 1st pandas.DataFrame is training dataset, while the 2nd one is testing dataset
target_name (string) – Target name in input data
tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment id
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis
figure_export (boolean) – If figure_export is True, the figures will be generated and exported.
random_state (int) – Random seed
filename (list) – Filename for input dataset
categories (list) – List of column names for one-hot encoding
ordinals (dict) –
- Dictionary of ordinal features:
Keys of dictionary: categorical feature name(s) Values of dictionary: categorical values
encoding_strategy (string) – Encoding strategy for categorical features defined in projects.json
param_grid (dict) – If grid_search is non-empty dictionary, then the experiment will do parameter tuning via GridSearchCV. It should report best result to UI and save all results to knowlegde base.
- Return type
None
- machine.learn.skl_utils.get_col_idx(feature_names_list, columns)[source]¶
Get unique indexes of columns based on list of column names. :param feature_names_list: List of column names on dataset :type feature_names_list: list :param columns: List of selected column names :type columns: list
- Returns
col_idx – list of selected column indexes
- Return type
list
- machine.learn.skl_utils.setup_model_params(model, parameter_name, value)[source]¶
Assign value to a parameter in a model. :param model: Machine learning model following scikit-learn API :type model: scikit-learn Estimator :param parameter_name: Parameter name in the scikit-learn model :type parameter_name: string :param value: Values for assigning to the parameter :type value: object
- Returns
model – A new scikit-learn model with a updated parameter
- Return type
scikit-learn Estimator
- machine.learn.skl_utils.compute_imp_score(model, metric, features, target, random_state)[source]¶
Compute permuation importance scores for features.
- Parameters
tmpdir (string) – Temporary directory for saving experiment results
model (scikit-learn Estimator) – A fitted scikit-learn model
metric (str, callable) – The metric for evaluating the feature importance through permutation. By default, the strings ‘accuracy’ is recommended for classifiers and the string ‘r2’ is recommended for regressors. Optionally, a custom scoring function (e.g., metric=scoring_func) that accepts two arguments, y_true and y_pred, which have similar shape to the y array.
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
random_state (int) – Random seed for permuation importances
- Returns
coefs (np.darray) – Feature importance scores
imp_score_type (string) – Importance score type
- machine.learn.skl_utils.save_json_fmt(outdir, _id, fname, content)[source]¶
Save results into json format.
- Parameters
outdir (string) – Path of output directory
_id (string) – Experiment ID in Aliro
fname (string) – File name
content (list or directory) – Content for results
- Return type
None
- machine.learn.skl_utils.plot_confusion_matrix(tmpdir, _id, X, y, class_names, cv_scores, figure_export)[source]¶
Make plot for confusion matrix.
- Parameters
tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
X (np.darray/pd.DataFrame) – Features in training dataset
y (np.darray/pd.DataFrame) – Target in training dataset
class_names (list) – List of class names
cv_scores (dictionary) – Return from sklearn.model_selection.cross_validate
figure_export (boolean) – If true, then export roc curve plot
- Return type
None
- machine.learn.skl_utils.plot_learning_curve(tmpdir, _id, model, features, target, cv, return_times=True)[source]¶
Make learning curve.
- Parameters
tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
model (user specified model) –
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
cv (int, cross-validation generator or an iterable) –
- Return type
None
- machine.learn.skl_utils.plot_pca_2d(tmpdir, _id, features, target)[source]¶
Make PCA on 2D.
- Parameters
tmpdir (string) – Temporary directory for saving 2d pca plot and json file
_id (string) – Experiment ID in Aliro
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
- Return type
None
- machine.learn.skl_utils.plot_tsne_2d(tmpdir, _id, features, target)[source]¶
Make tsne on 2D.
- Parameters
tmpdir (string) – Temporary directory for saving 2d t-sne plot and json file
_id (string) – Experiment ID in Aliro
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
- Return type
None
- machine.learn.skl_utils.plot_roc_curve(tmpdir, _id, X, y, cv_scores, figure_export)[source]¶
Plot ROC Curve. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param X: Features in training dataset :type X: np.darray/pd.DataFrame :param y: Target in training dataset :type y: np.darray/pd.DataFrame :param cv_scores: Return from sklearn.model_selection.cross_validate :type cv_scores: dictionary :param figure_export: If true, then export roc curve plot :type figure_export: boolean
- Return type
None
- machine.learn.skl_utils.plot_imp_score(tmpdir, _id, coefs, feature_names, imp_score_type)[source]¶
Plot importance scores for features. :param tmpdir: Temporary directory for saving experiment results :type tmpdir: string :param _id: Experiment ID in Aliro :type _id: string :param coefs: Feature importance scores :type coefs: array :param feature_names: List of feature names :type feature_names: np.array :param imp_score_type: Importance score type :type imp_score_type: string
- Returns
top_features (list) – Top features with high importance score
indices (ndarray) – Array of indices of top important features
- machine.learn.skl_utils.plot_dot_plot(tmpdir, _id, features, target, top_features_name, indices, random_state, mode)[source]¶
Make dot plot for based on decision tree.
- Parameters
tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
features (np.darray/pd.DataFrame) – Features in training dataset
target (np.darray/pd.DataFrame) – Target in training dataset
top_features (list) – Top feature_names
indices (ndarray) – Array of indices of top important features
random_state (int) – Random seed for permuation importances
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis
- Returns
Test score from fitting decision tree on top important feat’
- Return type
dtree_train_score, float
- machine.learn.skl_utils.export_model(tmpdir, _id, model, filename, target_name, mode='classification', random_state=42)[source]¶
export model as a pickle file and generate a scripts for using the pickled model.
- Parameters
tmpdir (string) – Temporary directory for saving experiment results
_id (string) – Experiment ID in Aliro
model (scikit-learn estimator) – A fitted scikit-learn model
filename (string) – File name of input dataset
target_name (string) – Target name in input data
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis
random_state (int) – Random seed in model
- Return type
None
- machine.learn.skl_utils.generate_export_codes(pickle_file_name, model, filename, target_name, mode='classification', random_state=42)[source]¶
Generate all library import calls for use in stand alone python scripts.
- Parameters
pickle_file_name (string) – Pickle file name for a fitted scikit-learn estimator
model (scikit-learn estimator) – A fitted scikit-learn model
filename (string) – File name of input dataset
target_name (string) – Target name in input data
mode (string) – ‘classification’: Run classification analysis ‘regression’: Run regression analysis
random_state (int) – Random seed in model
- Returns
pipeline_text – The Python scripts for applying the current optimized pipeline in stand-alone python environment
- Return type
String