Pipeline¶

class hana_automl.pipeline.data.Data(train: Optional[hana_ml.dataframe.DataFrame] = None, test: Optional[hana_ml.dataframe.DataFrame] = None, valid: Optional[hana_ml.dataframe.DataFrame] = None, target: Optional[str] = None, id_col: Optional[str] = None)¶

We needed to reuse and store data from dataset in one place, so we’ve created this class.

train¶

Train part of dataset

Type: DataFrame

test¶

Test part of dataset (30% of all data)

Type: DataFrame

valid¶

Validation part of dataset for model evaluation in the end of the process (10-15% of all data)

Type: DataFrame

id_colm¶

ID column. Needed for HANA.

Type: str

clear(num_strategy: str = 'mean', categorical_list: Optional[list] = None, normalizer_strategy: str = 'min-max', normalizer_z_score_method: str = '', normalize_int: bool = False, strategy_by_col: Optional[list] = None, drop_outers: bool = False, normalization_excp: Optional[list] = None, clean_sets: list = ['test', 'train', 'valid'])¶

Clears data using methods defined in parameters.

Parameters

num_strategy (str) – Strategy to decode numeric variables.
dropempty (Bool) – Drop empty rows or not.
categorical_list (list) – List of categorical features.
normalizer_strategy (str) – Strategy for normalization. Defaults to ‘min-max’.
normalizer_z_score_method (str) – A z-score (also called a standard score) gives you an idea of how far from the mean a data point is
normalize_int (bool) – Normalize integers or not
strategy_by_col (ListOfTuples) – Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: the 1st element is the name of a column; the 2nd element is the imputation strategy of that column(For numerical: “mean”, “median”, “delete”, “als”, ‘numerical_const’. Or categorical_const for categorical). If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column
clean_sets (ListOfStrings) – Specifies parts of dataset, that will be preprocessed. List should contain ‘test’,’train’ or ‘valid’. Other values will be ignored

Returns

Data – Data with changes.

Return type

Data

drop(droplist_columns: list)¶

Drops columns in table

Parameters: droplist_columns (list) – Columns to remove.

class hana_automl.pipeline.input.Input(connection_context: Optional[hana_ml.dataframe.ConnectionContext] = None, df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, target: Optional[str] = None, path: Optional[str] = None, id_col: Optional[str] = None, table_name: Optional[str] = None, verbose: bool = True)¶

Handles input data. You can use it aside pipeline to load data to database.

connection_context¶

Connection info to HANA database.

Type: hana_ml.dataframe.ConnectionContext

df¶

Pandas dataframe with data, or hana_ml dataframe, or string containing existing table name.

Type: pandas.DataFrame or hana_ml.dataframe.DataFrame or str

id_col¶

ID column for HANA table.

Type: str

file_path¶

Path to data file.

Type: str

target¶

Target variable that we want to predict.

Type: str

table_name¶

Table’s name in HANA database.

Type: str

hana_df¶

Converted HANA dataframe.

Type: hana_ml.dataframe

verbose¶: Level of output

static download_data(path: str)¶

Downloads data from path

Parameters: path (str) – Path/url to the file.
Raises: InputError – If file format is wrong.

load_data()¶: Loads data to HANA database.

split_data() → hana_automl.pipeline.data.Data ¶

Splits single dataframe into multiple dataframes and passes them to Data.

Returns: Data with changes.
Return type: Data

class hana_automl.pipeline.modelres.ModelBoard(algorithm, train_score: float, preprocessor: hana_automl.preprocess.settings.PreprocessorSettings)¶: This class stores models that are shown in leaderboard.

class hana_automl.pipeline.pipeline.Pipeline(data: hana_automl.pipeline.data.Data, steps: int, task: str, time_limit: Optional[int] = None, verbose=2, tuning_metric=None)¶

The ‘director’ of the whole hyperparameter searching process.

data¶

Input data.

Type: Data

iter¶

Number of iterations.

Type: int

opt¶: Optimizer.

time_limit¶

In seconds

Type: int

verbose¶: Level of output.

train(categorical_features: Optional[list] = None, optimizer: Optional[str] = None)¶

Preprocesses data and starts optimization.

Parameters

categorical_features (list) – List of categorical features.
optimizer (string) – Optimizer for searching for hyperparameters. Currently supported: “OptunaSearch” (default), “BayesianOptimizer” (unstable)

Returns

Optimizer.

Return type

opt