AutoML class

class hana_automl.automl.AutoML(connection_context: Optional[hana_ml.dataframe.ConnectionContext] = None)
Main class. Control the whole Automated Machine Learning process here.

What is AutoML? Read here: https://www.automl.org/automl/

connection_context

Connection info to HANA database.

Type

hana_ml.dataframe.ConnectionContext

opt

Optimizer from pipeline.

model

Tuned and fitted HANA PAL model.

predicted

Dataframe containing predicted values.

preprocessor_settings

Preprocessor settings.

Type

PreprocessorSettings

property best_params

Get best hyperparameters

fit(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, task: Optional[str] = None, steps: Optional[int] = None, target: Optional[str] = None, columns_to_remove: Optional[list] = None, categorical_features: Optional[list] = None, id_column: Optional[str] = None, optimizer: str = 'OptunaSearch', time_limit: Optional[int] = None, ensemble: bool = False, verbose=2, output_leaderboard: bool = False, strategy_by_col: Optional[list] = None, tuning_metric: Optional[str] = None)

Fits AutoML object

Parameters
  • df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information

  • file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples:https://website/dataframe.csv’ or ‘users/dev/file.csv’

  • table_name (str) – Name of table in HANA database. See Notes for extra information

  • task (str) – Machine Learning task. ‘reg’(regression) and ‘cls’(classification) are currently supported.

  • steps (int) – Number of iterations.

  • target (str) – The column we want to predict. For multiple columns pass a list. Example: [‘feature1’, ‘feature2’].

  • columns_to_remove (list) – List of columns to delete. Example: [‘column1’, ‘column2’].

  • categorical_features (list) – Categorical features are columns that generally take a limited number of possible values. For example, if our target variable contains only 1 and 0, it is concerned as categorical. Another example: column ‘gender’ contains ‘male’, ‘female’ and ‘other’ values. As it (generally) can’t contain any other values, it is categorical. Details here: https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63 Example: [‘column1’, ‘column2’].

  • id_column (str) – ID column in table. Example: ‘ID’ Needed for HANA. If None, it will be created in dataset automatically

  • optimizer (str) – Optimizer to tune hyperparameters. Currently supported: “OptunaSearch” (default), “BayesianOptimizer” (unstable)

  • time_limit (int) – Amount of time(in seconds) to tune the model

  • ensemble (bool) – Specify if you want to get an ensemble. Currently supported: “blending”. What is that? Details here: Ensembles

  • verbose (int) – Level of output. 1 - minimal, 2 - all output.

  • output_leaderboard (bool) – Print algorithms leaderboard or not.

  • strategy_by_col (ListOfTuples) – Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: the 1st element is the name of a column; the 2nd element is the imputation strategy of that column(For numerical: “mean”, “median”, “delete”, “als”, ‘numerical_const’. Or categorical_const for categorical). If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column

Notes

There are multiple options to load data in HANA database. Here are all available parameter combinations:

1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table

3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used

4) table_name -> we’ll connect to existing table

5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

6) file_path + table_name -> data from file/url will be loaded to existing table

7) df: str -> we’ll connect to existing table

Examples

Passing connection info:

>>> from hana_ml.dataframe import ConnectionContext
>>> cc = ConnectionContext(address='database address',
...                        user='your username',
...                        password='your password',
...                        port=9999) #your port

Creating and fitting the model (see fitting section for detailed example): >>> from hana_automl.automl import AutoML >>> automl = AutoML(cc) >>> automl.fit( … df = df, # your pandas dataframe. … target=”y”, … id_column=’ID’, … categorical_features=[“y”, ‘marital’, ‘education’, ‘housing’, ‘loan’], … columns_to_remove=[‘default’, ‘contact’, ‘month’, ‘poutcome’, ‘job’], … steps=10, … )

get_algorithm()

Returns fitted AutoML algorithm. If ‘ensemble’ parameter is True, returns ensemble algorithm.

get_leaderboard()

Get best hyperparameters

get_model()

Returns fitted HANA PAL model

model_to_file(file_path: str)

Saves model information to JSON file

Parameters

file_path (str) – Path to save

property optimizer

Get optimizer

predict(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, id_column: Optional[str] = None, target_drop: Optional[str] = None, verbose=1)pandas.core.frame.DataFrame

Makes predictions using fitted model.

Parameters
  • df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information

  • file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples:https://website/dataframe.csv’ or ‘users/dev/file.csv’

  • table_name (str) – Name of table in HANA database. See Notes for extra information

  • id_column (str) – ID column in table. Needed for HANA. If None, it will be created in dataset automatically

  • target_drop (str) – Target to drop, if it exists in inputted data

  • verbose (int) – Level of output. 0 - minimal, 1 - all output.

Notes

There are multiple options to load data in HANA database. Here are all available parameter combinations:

1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table

3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used

4) table_name -> we’ll connect to existing table

5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

6) file_path + table_name -> data from file/url will be loaded to existing table

7) df: str -> we’ll connect to existing table

Returns

Return type

Pandas dataframe with predictions.

Examples

>>> automl.predict(file_path='data/predict.csv',
...                table_name='PREDICTION',
...                id_column='ID',
...                target_drop='target',
...                verbose=1)
print_leaderboard()

Output leaderboard

save_results_as_csv(file_path: str)

Saves prediciton results to .csv file

Parameters

file_path (str) – Path to save

save_stats_as_csv(file_path: str)

Saves prediciton statistics to .csv file

Parameters

file_path (str) – Path to save

score(df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, file_path: Optional[str] = None, table_name: Optional[str] = None, target: Optional[str] = None, id_column: Optional[str] = None, metric=None)float

Returns model score.

Parameters
  • df (pandas.DataFrame or hana_ml.dataframe.DataFrame or str) – Attention: You must pass whole dataframe, without dividing it in X_train, y_train, etc. See Notes for extra information

  • file_path (str) – Path/url to dataframe. Accepts .csv and .xlsx. See Notes for extra information Examples:https://website/dataframe.csv’ or ‘users/dev/file.csv’

  • table_name (str) – Name of table in HANA database. See Notes for extra information

  • target (str) – Variable to predict. It will be dropped.

  • id_column (str) – ID column. If None, it’ll be generated automatically.

Notes

There are multiple options to load data in HANA database. Here are all available parameter combinations:

1) df: pandas.DataFrame -> dataframe will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

2) df: pandas.DataFrame + table_name -> dataframe will be loaded to existing table

3) df: hana_ml.dataframe.DataFrame -> existing Dataframe will be used

4) table_name -> we’ll connect to existing table

5) file_path -> data from file/url will be loaded to a new table with random name, like ‘AUTOML-9082-842408-12’

6) file_path + table_name -> data from file/url will be loaded to existing table

7) df: str -> we’ll connect to existing table

Returns

score – Model score.

Return type

float

sort_leaderboard(metric, df=None, id_col=None, target=None, verbose=1)

Sorts leaderboard by given metric