AutoML class¶
-
class
hana_automl.automl.
AutoML
(connection_context: Optional[hana_ml.dataframe.ConnectionContext] = None)¶ - Main class. Control the whole Automated Machine Learning process here.
What is AutoML? Read here: https://www.automl.org/automl/
-
connection_context
¶ Connection info to HANA database.
- Type
hana_ml.dataframe.ConnectionContext
-
opt
¶ Optimizer from pipeline.
-
model
¶ Tuned and fitted HANA PAL model.
-
predicted
¶ Dataframe containing predicted values.
-
preprocessor_settings
¶ Preprocessor settings.
- Type
-
property
best_params
¶ Get best hyperparameters
-
fit
(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, task: Optional[str] = None, steps: Optional[int] = None, target: Optional[str] = None, columns_to_remove: Optional[list] = None, categorical_features: Optional[list] = None, id_column: Optional[str] = None, optimizer: str = 'OptunaSearch', time_limit: Optional[int] = None, ensemble: bool = False, verbose=2, output_leaderboard: bool = False, strategy_by_col: Optional[list] = None, tuning_metric: Optional[str] = None)¶ Fits AutoML object
- Parameters
df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information
file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples: ‘https://website/dataframe.csv’ or ‘users/dev/file.csv’
table_name (str) – Name of table in HANA database. See Notes for extra information
task (str) – Machine Learning task. ‘reg’(regression) and ‘cls’(classification) are currently supported.
steps (int) – Number of iterations.
target (str) – The column we want to predict. For multiple columns pass a list. Example: [‘feature1’, ‘feature2’].
columns_to_remove (list) – List of columns to delete. Example: [‘column1’, ‘column2’].
categorical_features (list) – Categorical features are columns that generally take a limited number of possible values. For example, if our target variable contains only 1 and 0, it is concerned as categorical. Another example: column ‘gender’ contains ‘male’, ‘female’ and ‘other’ values. As it (generally) can’t contain any other values, it is categorical. Details here: https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63 Example: [‘column1’, ‘column2’].
id_column (str) – ID column in table. Example: ‘ID’ Needed for HANA. If None, it will be created in dataset automatically
optimizer (str) – Optimizer to tune hyperparameters. Currently supported: “OptunaSearch” (default), “BayesianOptimizer” (unstable)
time_limit (int) – Amount of time(in seconds) to tune the model
ensemble (bool) – Specify if you want to get an ensemble. Currently supported: “blending”. What is that? Details here: Ensembles
verbose (int) – Level of output. 1 - minimal, 2 - all output.
output_leaderboard (bool) – Print algorithms leaderboard or not.
strategy_by_col (ListOfTuples) – Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: the 1st element is the name of a column; the 2nd element is the imputation strategy of that column(For numerical: “mean”, “median”, “delete”, “als”, ‘numerical_const’. Or categorical_const for categorical). If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column
Notes
There are multiple options to load data in HANA database. Here are all available parameter combinations:
1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’
2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table
3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used
4) table_name -> we’ll connect to existing table
5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’
6) file_path + table_name -> data from file/url will be loaded to existing table
7) df: str -> we’ll connect to existing table
Examples
Passing connection info:
>>> from hana_ml.dataframe import ConnectionContext >>> cc = ConnectionContext(address='database address', ... user='your username', ... password='your password', ... port=9999) #your port
Creating and fitting the model (see fitting section for detailed example): >>> from hana_automl.automl import AutoML >>> automl = AutoML(cc) >>> automl.fit( … df = df, # your pandas dataframe. … target=”y”, … id_column=’ID’, … categorical_features=[“y”, ‘marital’, ‘education’, ‘housing’, ‘loan’], … columns_to_remove=[‘default’, ‘contact’, ‘month’, ‘poutcome’, ‘job’], … steps=10, … )
-
get_algorithm
()¶ Returns fitted AutoML algorithm. If ‘ensemble’ parameter is True, returns ensemble algorithm.
-
get_leaderboard
()¶ Get best hyperparameters
-
get_model
()¶ Returns fitted HANA PAL model
-
model_to_file
(file_path: str)¶ Saves model information to JSON file
- Parameters
file_path (str) – Path to save
-
property
optimizer
¶ Get optimizer
-
predict
(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, id_column: Optional[str] = None, target_drop: Optional[str] = None, verbose=1) → pandas.core.frame.DataFrame¶ Makes predictions using fitted model.
- Parameters
df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information
file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples: ‘https://website/dataframe.csv’ or ‘users/dev/file.csv’
table_name (str) – Name of table in HANA database. See Notes for extra information
id_column (str) – ID column in table. Needed for HANA. If None, it will be created in dataset automatically
target_drop (str) – Target to drop, if it exists in inputted data
verbose (int) – Level of output. 0 - minimal, 1 - all output.
Notes
There are multiple options to load data in HANA database. Here are all available parameter combinations:
1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’
2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table
3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used
4) table_name -> we’ll connect to existing table
5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’
6) file_path + table_name -> data from file/url will be loaded to existing table
7) df: str -> we’ll connect to existing table
- Returns
- Return type
Pandas dataframe with predictions.
Examples
>>> automl.predict(file_path='data/predict.csv', ... table_name='PREDICTION', ... id_column='ID', ... target_drop='target', ... verbose=1)
-
print_leaderboard
()¶ Output leaderboard
-
save_results_as_csv
(file_path: str)¶ Saves prediciton results to .csv file
- Parameters
file_path (str) – Path to save
-
save_stats_as_csv
(file_path: str)¶ Saves prediciton statistics to .csv file
- Parameters
file_path (str) – Path to save
-
score
(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, target: Optional[str] = None, id_column: Optional[str] = None, metric=None) → float¶ Returns model score.
- Parameters
df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information
file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples: ‘https://website/dataframe.csv’ or ‘users/dev/file.csv’
table_name (str) – Name of table in HANA database. See Notes for extra information
target (str) – Variable to predict. It will be dropped.
id_column (str) – ID column. If None, it’ll be generated automatically.
Notes
There are multiple options to load data in HANA database. Here are all available parameter combinations:
1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’
2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table
3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used
4) table_name -> we’ll connect to existing table
5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’
6) file_path + table_name -> data from file/url will be loaded to existing table
7) df: str -> we’ll connect to existing table
- Returns
score – Model score.
- Return type
float
-
sort_leaderboard
(metric, df=None, id_col=None, target=None, verbose=1)¶ Sorts leaderboard by given metric