Pipeline

class hana_automl.pipeline.data.Data(train: Optional[hana_ml.dataframe.DataFrame] = None, test: Optional[hana_ml.dataframe.DataFrame] = None, valid: Optional[hana_ml.dataframe.DataFrame] = None, target: Optional[str] = None, id_col: Optional[str] = None)

We needed to reuse and store data from dataset in one place, so we’ve created this class.

train

Train part of dataset

Type

DataFrame

test

Test part of dataset (30% of all data)

Type

DataFrame

valid

Validation part of dataset for model evaluation in the end of the process (10-15% of all data)

Type

DataFrame

id_colm

ID column. Needed for HANA.

Type

str

clear(num_strategy: str = 'mean', categorical_list: Optional[list] = None, normalizer_strategy: str = 'min-max', normalizer_z_score_method: str = '', normalize_int: bool = False, strategy_by_col: Optional[list] = None, drop_outers: bool = False, normalization_excp: Optional[list] = None, clean_sets: list = ['test', 'train', 'valid'])

Clears data using methods defined in parameters.

Parameters
  • num_strategy (str) – Strategy to decode numeric variables.

  • dropempty (Bool) – Drop empty rows or not.

  • categorical_list (list) – List of categorical features.

  • normalizer_strategy (str) – Strategy for normalization. Defaults to ‘min-max’.

  • normalizer_z_score_method (str) – A z-score (also called a standard score) gives you an idea of how far from the mean a data point is

  • normalize_int (bool) – Normalize integers or not

  • strategy_by_col (ListOfTuples) – Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: the 1st element is the name of a column; the 2nd element is the imputation strategy of that column(For numerical: “mean”, “median”, “delete”, “als”, ‘numerical_const’. Or categorical_const for categorical). If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column

  • clean_sets (ListOfStrings) – Specifies parts of dataset, that will be preprocessed. List should contain ‘test’,’train’ or ‘valid’. Other values will be ignored

Returns

Data – Data with changes.

Return type

Data

drop(droplist_columns: list)

Drops columns in table

Parameters

droplist_columns (list) – Columns to remove.

class hana_automl.pipeline.input.Input(connection_context: Optional[hana_ml.dataframe.ConnectionContext] = None, df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, target: Optional[str] = None, path: Optional[str] = None, id_col: Optional[str] = None, table_name: Optional[str] = None, verbose: bool = True)

Handles input data. You can use it aside pipeline to load data to database.

connection_context

Connection info to HANA database.

Type

hana_ml.dataframe.ConnectionContext

df

Pandas dataframe with data, or hana_ml dataframe, or string containing existing table name.

Type

pandas.DataFrame or hana_ml.dataframe.DataFrame or str

id_col

ID column for HANA table.

Type

str

file_path

Path to data file.

Type

str

target

Target variable that we want to predict.

Type

str

table_name

Table’s name in HANA database.

Type

str

hana_df

Converted HANA dataframe.

Type

hana_ml.dataframe

verbose

Level of output

static download_data(path: str)

Downloads data from path

Parameters

path (str) – Path/url to the file.

Raises

InputError – If file format is wrong.

load_data()

Loads data to HANA database.

split_data()hana_automl.pipeline.data.Data

Splits single dataframe into multiple dataframes and passes them to Data.

Returns

Data with changes.

Return type

Data

class hana_automl.pipeline.modelres.ModelBoard(algorithm, train_score: float, preprocessor: hana_automl.preprocess.settings.PreprocessorSettings)

This class stores models that are shown in leaderboard.

class hana_automl.pipeline.pipeline.Pipeline(data: hana_automl.pipeline.data.Data, steps: int, task: str, time_limit: Optional[int] = None, verbose=2, tuning_metric=None)

The ‘director’ of the whole hyperparameter searching process.

data

Input data.

Type

Data

iter

Number of iterations.

Type

int

opt

Optimizer.

time_limit

In seconds

Type

int

verbose

Level of output.

train(categorical_features: Optional[list] = None, optimizer: Optional[str] = None)

Preprocesses data and starts optimization.

Parameters
  • categorical_features (list) – List of categorical features.

  • optimizer (string) – Optimizer for searching for hyperparameters. Currently supported: “OptunaSearch” (default), “BayesianOptimizer” (unstable)

Returns

Optimizer.

Return type

opt