pyspark_ds_toolbox.ml.data_prep package

Submodules

pyspark_ds_toolbox.ml.data_prep.class_weights module

Module dedicated to functionalities related to class weighting tools.

pyspark_ds_toolbox.ml.data_prep.class_weights.binary_classifier_weights(dfs: pyspark.sql.dataframe.DataFrame, col_target: str) → pyspark.sql.dataframe.DataFrame

Adds a class weight columns to a binary classification response column.

Parameters

dfs (pyspark.sql.dataframe.DataFrame) – Training dataset with the col_target column.
col_target (str) – Column name of the column that contains the response variable for the model. It should contain only values of 0 and 1.

Raises

ValueError – If unique values from col_target column are not 0 and 1.

Returns

The dfs object with a weight_{col_target} column.

Return type

pyspark.sql.dataframe.DataFrame

pyspark_ds_toolbox.ml.data_prep.features_vector module

Module dedicated to features spark vector tools.

pyspark_ds_toolbox.ml.data_prep.features_vector.get_features_vector(num_features: Optional[List[str]] = None, cat_features: Optional[List[str]] = None, output_col='features') → List

Assembles a features vector to be used with ML algorithms.

Parameters

num_features (List[str]) – List of columns names of numeric features;
cat_features (List[str]) – List of column names of categorical features (StringIndexer);
output_col (str) – name of the output column;

Raises

TypeError – If num_features AND cat_features are os type None.

Returns

pyspark indexers, encoders and assemblers like a list;

Return type

[List]

Module contents

Sub-package Dedicated to Data Preparation for Machine Learning tools.