pyspark_ds_toolbox.ml.data_prep package
Submodules
pyspark_ds_toolbox.ml.data_prep.class_weights module
Module dedicated to functionalities related to class weighting tools.
- pyspark_ds_toolbox.ml.data_prep.class_weights.binary_classifier_weights(dfs: pyspark.sql.dataframe.DataFrame, col_target: str) pyspark.sql.dataframe.DataFrame
Adds a class weight columns to a binary classification response column.
- Parameters
dfs (pyspark.sql.dataframe.DataFrame) – Training dataset with the col_target column.
col_target (str) – Column name of the column that contains the response variable for the model. It should contain only values of 0 and 1.
- Raises
ValueError – If unique values from col_target column are not 0 and 1.
- Returns
The dfs object with a weight_{col_target} column.
- Return type
pyspark.sql.dataframe.DataFrame
pyspark_ds_toolbox.ml.data_prep.features_vector module
Module dedicated to features spark vector tools.
- pyspark_ds_toolbox.ml.data_prep.features_vector.get_features_vector(num_features: Optional[List[str]] = None, cat_features: Optional[List[str]] = None, output_col='features') List
Assembles a features vector to be used with ML algorithms.
- Parameters
num_features (List[str]) – List of columns names of numeric features;
cat_features (List[str]) – List of column names of categorical features (StringIndexer);
output_col (str) – name of the output column;
- Raises
TypeError – If num_features AND cat_features are os type None.
- Returns
pyspark indexers, encoders and assemblers like a list;
- Return type
[List]
Module contents
Sub-package Dedicated to Data Preparation for Machine Learning tools.