AutoML class¶

class hana_automl.automl.AutoML(connection_context: Optional[hana_ml.dataframe.ConnectionContext] = None)¶

Main class. Control the whole Automated Machine Learning process here.: What is AutoML? Read here: https://www.automl.org/automl/

connection_context¶

Connection info to HANA database.

Type: hana_ml.dataframe.ConnectionContext

opt¶: Optimizer from pipeline.

model¶: Tuned and fitted HANA PAL model.

predicted¶: Dataframe containing predicted values.

preprocessor_settings¶

Preprocessor settings.

Type: PreprocessorSettings

property best_params¶: Get best hyperparameters

fit(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, task: Optional[str] = None, steps: Optional[int] = None, target: Optional[str] = None, columns_to_remove: Optional[list] = None, categorical_features: Optional[list] = None, id_column: Optional[str] = None, optimizer: str = 'OptunaSearch', time_limit: Optional[int] = None, ensemble: bool = False, verbose=2, output_leaderboard: bool = False, strategy_by_col: Optional[list] = None, tuning_metric: Optional[str] = None)¶

Fits AutoML object

Parameters

df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information
file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples: ‘https://website/dataframe.csv’ or ‘users/dev/file.csv’
table_name (str) – Name of table in HANA database. See Notes for extra information
task (str) – Machine Learning task. ‘reg’(regression) and ‘cls’(classification) are currently supported.
steps (int) – Number of iterations.
target (str) – The column we want to predict. For multiple columns pass a list. Example: [‘feature1’, ‘feature2’].
columns_to_remove (list) – List of columns to delete. Example: [‘column1’, ‘column2’].
categorical_features (list) – Categorical features are columns that generally take a limited number of possible values. For example, if our target variable contains only 1 and 0, it is concerned as categorical. Another example: column ‘gender’ contains ‘male’, ‘female’ and ‘other’ values. As it (generally) can’t contain any other values, it is categorical. Details here: https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63 Example: [‘column1’, ‘column2’].
id_column (str) – ID column in table. Example: ‘ID’ Needed for HANA. If None, it will be created in dataset automatically
optimizer (str) – Optimizer to tune hyperparameters. Currently supported: “OptunaSearch” (default), “BayesianOptimizer” (unstable)
time_limit (int) – Amount of time(in seconds) to tune the model
ensemble (bool) – Specify if you want to get an ensemble. Currently supported: “blending”. What is that? Details here: Ensembles
verbose (int) – Level of output. 1 - minimal, 2 - all output.
output_leaderboard (bool) – Print algorithms leaderboard or not.
strategy_by_col (ListOfTuples) – Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: the 1st element is the name of a column; the 2nd element is the imputation strategy of that column(For numerical: “mean”, “median”, “delete”, “als”, ‘numerical_const’. Or categorical_const for categorical). If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column

Notes

There are multiple options to load data in HANA database. Here are all available parameter combinations:

1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table

3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used

4) table_name -> we’ll connect to existing table

5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

6) file_path + table_name -> data from file/url will be loaded to existing table

7) df: str -> we’ll connect to existing table

Examples

Passing connection info:

>>> from hana_ml.dataframe import ConnectionContext
>>> cc = ConnectionContext(address='database address',
...                        user='your username',
...                        password='your password',
...                        port=9999) #your port

Creating and fitting the model (see fitting section for detailed example): >>> from hana_automl.automl import AutoML >>> automl = AutoML(cc) >>> automl.fit( … df = df, # your pandas dataframe. … target=”y”, … id_column=’ID’, … categorical_features=[“y”, ‘marital’, ‘education’, ‘housing’, ‘loan’], … columns_to_remove=[‘default’, ‘contact’, ‘month’, ‘poutcome’, ‘job’], … steps=10, … )

get_algorithm()¶: Returns fitted AutoML algorithm. If ‘ensemble’ parameter is True, returns ensemble algorithm.

get_leaderboard()¶: Get best hyperparameters

get_model()¶: Returns fitted HANA PAL model

model_to_file(file_path: str)¶

Saves model information to JSON file

Parameters: file_path (str) – Path to save

property optimizer¶: Get optimizer

predict(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, id_column: Optional[str] = None, target_drop: Optional[str] = None, verbose=1) → pandas.core.frame.DataFrame¶

Makes predictions using fitted model.

Parameters

df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information
file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples: ‘https://website/dataframe.csv’ or ‘users/dev/file.csv’
table_name (str) – Name of table in HANA database. See Notes for extra information
id_column (str) – ID column in table. Needed for HANA. If None, it will be created in dataset automatically
target_drop (str) – Target to drop, if it exists in inputted data
verbose (int) – Level of output. 0 - minimal, 1 - all output.

Notes

There are multiple options to load data in HANA database. Here are all available parameter combinations:

1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table

3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used

4) table_name -> we’ll connect to existing table

5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

6) file_path + table_name -> data from file/url will be loaded to existing table

7) df: str -> we’ll connect to existing table

Returns
Return type: Pandas dataframe with predictions.

Examples

>>> automl.predict(file_path='data/predict.csv',
...                table_name='PREDICTION',
...                id_column='ID',
...                target_drop='target',
...                verbose=1)

print_leaderboard()¶: Output leaderboard

save_results_as_csv(file_path: str)¶

Saves prediciton results to .csv file

Parameters: file_path (str) – Path to save

save_stats_as_csv(file_path: str)¶

Saves prediciton statistics to .csv file

Parameters: file_path (str) – Path to save

score(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, target: Optional[str] = None, id_column: Optional[str] = None, metric=None) → float¶

Returns model score.

Parameters

df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information
file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples: ‘https://website/dataframe.csv’ or ‘users/dev/file.csv’
table_name (str) – Name of table in HANA database. See Notes for extra information
target (str) – Variable to predict. It will be dropped.
id_column (str) – ID column. If None, it’ll be generated automatically.

Notes

There are multiple options to load data in HANA database. Here are all available parameter combinations:

1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table

3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used

4) table_name -> we’ll connect to existing table

5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

6) file_path + table_name -> data from file/url will be loaded to existing table

7) df: str -> we’ll connect to existing table

Returns: score – Model score.
Return type: float

sort_leaderboard(metric, df=None, id_col=None, target=None, verbose=1)¶: Sorts leaderboard by given metric