Pipeline¶
-
class
hana_automl.pipeline.data.Data(train: Optional[hana_ml.dataframe.DataFrame] = None, test: Optional[hana_ml.dataframe.DataFrame] = None, valid: Optional[hana_ml.dataframe.DataFrame] = None, target: Optional[str] = None, id_col: Optional[str] = None)¶ We needed to reuse and store data from dataset in one place, so we’ve created this class.
-
train¶ Train part of dataset
- Type
DataFrame
-
test¶ Test part of dataset (30% of all data)
- Type
DataFrame
-
valid¶ Validation part of dataset for model evaluation in the end of the process (10-15% of all data)
- Type
DataFrame
-
id_colm¶ ID column. Needed for HANA.
- Type
str
-
clear(num_strategy: str = 'mean', categorical_list: Optional[list] = None, normalizer_strategy: str = 'min-max', normalizer_z_score_method: str = '', normalize_int: bool = False, strategy_by_col: Optional[list] = None, drop_outers: bool = False, normalization_excp: Optional[list] = None, clean_sets: list = ['test', 'train', 'valid'])¶ Clears data using methods defined in parameters.
- Parameters
num_strategy (str) – Strategy to decode numeric variables.
dropempty (Bool) – Drop empty rows or not.
categorical_list (list) – List of categorical features.
normalizer_strategy (str) – Strategy for normalization. Defaults to ‘min-max’.
normalizer_z_score_method (str) – A z-score (also called a standard score) gives you an idea of how far from the mean a data point is
normalize_int (bool) – Normalize integers or not
strategy_by_col (ListOfTuples) – Specifies the imputation strategy for a set of columns, which overrides the overall strategy for data imputation. Each tuple in the list should contain at least two elements, such that: the 1st element is the name of a column; the 2nd element is the imputation strategy of that column(For numerical: “mean”, “median”, “delete”, “als”, ‘numerical_const’. Or categorical_const for categorical). If the imputation strategy is ‘categorical_const’ or ‘numerical_const’, then a 3rd element must be included in the tuple, which specifies the constant value to be used to substitute the detected missing values in the column
clean_sets (ListOfStrings) – Specifies parts of dataset, that will be preprocessed. List should contain ‘test’,’train’ or ‘valid’. Other values will be ignored
- Returns
Data – Data with changes.
- Return type
-
drop(droplist_columns: list)¶ Drops columns in table
- Parameters
droplist_columns (list) – Columns to remove.
-
-
class
hana_automl.pipeline.input.Input(connection_context: Optional[hana_ml.dataframe.ConnectionContext] = None, df: Optional[Union[pandas.core.frame.DataFrame, hana_ml.dataframe.DataFrame, str]] = None, target: Optional[str] = None, path: Optional[str] = None, id_col: Optional[str] = None, table_name: Optional[str] = None, verbose: bool = True)¶ Handles input data. You can use it aside pipeline to load data to database.
-
connection_context¶ Connection info to HANA database.
- Type
hana_ml.dataframe.ConnectionContext
-
df¶ Pandas dataframe with data, or hana_ml dataframe, or string containing existing table name.
- Type
pandas.DataFrame or hana_ml.dataframe.DataFrame or str
-
id_col¶ ID column for HANA table.
- Type
str
-
file_path¶ Path to data file.
- Type
str
-
target¶ Target variable that we want to predict.
- Type
str
-
table_name¶ Table’s name in HANA database.
- Type
str
-
hana_df¶ Converted HANA dataframe.
- Type
hana_ml.dataframe
-
verbose¶ Level of output
-
static
download_data(path: str)¶ Downloads data from path
- Parameters
path (str) – Path/url to the file.
- Raises
InputError – If file format is wrong.
-
load_data()¶ Loads data to HANA database.
-
split_data() → hana_automl.pipeline.data.Data¶ Splits single dataframe into multiple dataframes and passes them to Data.
- Returns
Data with changes.
- Return type
-
-
class
hana_automl.pipeline.modelres.ModelBoard(algorithm, train_score: float, preprocessor: hana_automl.preprocess.settings.PreprocessorSettings)¶ This class stores models that are shown in leaderboard.
-
class
hana_automl.pipeline.pipeline.Pipeline(data: hana_automl.pipeline.data.Data, steps: int, task: str, time_limit: Optional[int] = None, verbose=2, tuning_metric=None)¶ The ‘director’ of the whole hyperparameter searching process.
-
iter¶ Number of iterations.
- Type
int
-
opt¶ Optimizer.
-
time_limit¶ In seconds
- Type
int
-
verbose¶ Level of output.
-
train(categorical_features: Optional[list] = None, optimizer: Optional[str] = None)¶ Preprocesses data and starts optimization.
- Parameters
categorical_features (list) – List of categorical features.
optimizer (string) – Optimizer for searching for hyperparameters. Currently supported: “OptunaSearch” (default), “BayesianOptimizer” (unstable)
- Returns
Optimizer.
- Return type
opt
-