pyspark_ds_toolbox.ml.feature_selection package

Submodules

pyspark_ds_toolbox.ml.feature_selection.information_value module

class pyspark_ds_toolbox.ml.feature_selection.information_value.WeightOfEvidenceComputer(inputCol=None, inputCols=None, col_target=None)

Bases: pyspark.ml.base.Transformer, pyspark.ml.param.shared.HasInputCol, pyspark.ml.param.shared.HasOutputCol, pyspark.ml.param.shared.HasInputCols, pyspark.ml.param.shared.HasOutputCols, pyspark.ml.util.DefaultParamsReadable, pyspark.ml.util.DefaultParamsWritable

A transform to add the weight of evidence value for categorical variables.

This is a class to add the columns with the Weight of Evidence for categorical features. See http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf for a technical description.

add_woe(dfs: pyspark.sql.dataframe.DataFrame, feature: str, col_target: str) pyspark.sql.dataframe.DataFrame

Function that add teh WOE to the dataset passed to the tranform method of the class.

Parameters
  • dfs (DataFrame) – A spark DataFrame with feature and target columns.

  • feature (str) – Column name of a categorical feature, must be of type string.

  • col_target (str) – Column name of a target variable, must be of type integer with values 0 and 1.

Returns

The dfs argument with a column f’{feature}_woe’.

Return type

DataFrame

checkParams()
col_target = Param(parent='undefined', name='col_target', doc='Column name of the target. Must be a integer os values 0 or 1.')
getTarget()
setInputCol(new_inputCol)
setInputCols(new_inputCols)
setParams(inputCol=None, inputCols=None, col_target=None)
setTarget(new_target)
pyspark_ds_toolbox.ml.feature_selection.information_value.compute_woe_iv(dfs: pyspark.sql.dataframe.DataFrame, col_feature: str, col_target: str) Tuple[pyspark.sql.dataframe.DataFrame, float]

Function that given a DataFrame, a categorical feature and a binary target column computes the Information Value.

See http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf for a technical description.

Parameters
  • dfs (DataFrame) – A spark DataFrame with col_feature and col_target;

  • col_feature (str) – Column name of a categorical feature.

  • col_target (str) – Column name of a binary target column. Must be of integer type and have only value 0 and 1.

Raises
  • TypeError – if dfs.schema[col_target].dataType != T.IntegerType

  • ValueError – unique_target_values != [0, 1]

Returns

A two element tuple with the follwing objects:
  • Spark DataFrame with ‘feature’, ‘feature_value’, ‘woe’, ‘iv’ column;

  • float with the col_feature information value.

Return type

Tuple[DataFrame, float]

pyspark_ds_toolbox.ml.feature_selection.information_value.feature_selection_with_iv(dfs: pyspark.sql.dataframe.DataFrame, col_target: str, num_features: Optional[List[str]], cat_features: Optional[List[str]], floor_iv: float = 0.3, bucket_fraction: float = 0.1, categorical_as_woe: Optional[bool] = False) dict

Function that executes a feature selection based on the information value methodology.

This function computes the information value for the features passed and based on the floor_iv select the features that should ne used in the modeling part. It return analytical dataframes (with the Weight of Evidence and Information Value).

But the main advantage is that it also returns a list os stages to be passed to a pyspark.ml.Pipeline that will encode and assemble the select variables.

See http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf for a technical description and https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html for a introduction on the subject.

Parameters
  • dfs (DataFrame) – A spark DataFrame.

  • col_target (str) – Column name of a binary target column. Must be of integer type and have only value 0 and 1.

  • num_features (Union[List[str], None]) – List of columns names of numeric features. This column will be transformed with QuantileDiscretizer.

  • cat_features (Union[List[str], None]) – List of columns names of categorical features;

  • floor_iv (float, optional) – Threshold for a feature to be selected, greater or equal this value. Defaults to 0.3.

  • bucket_fraction (float, optional) – Fraction of the dataset to be used to create the buckets Must be between 0.05 and 0.5. Defaults to 0.1.

  • categorical_as_woe (bool, optional) – If True categorical variables will as Weight of Evidence. If False one hot encoding. Defaults to False.

Raises
  • TypeError – if (num_features is None) and (cat_features is None)

  • ValueError – if (bucket_fraction < 0.05) or (bucket_fraction > 0.5)

Returns

A dict with the following structure
  • dfs_woe: Spark DataFrame with WOE and IV for each feature value. Given by the cols: feature, feature_value, woe, iv.

  • dfs_iv: Spark DataFrame with IV for each feature. Given by the cols: feature, iv.

  • stages_features_vector: List with spark transformers that computes the features vector based on the floor_iv and categorical_as_woe params.

Return type

dict

Module contents

Feature Selection tools.

Subpackage dedicated to Feature Selection tools.