pyspark_ds_toolbox.ml.feature_selection package
Submodules
pyspark_ds_toolbox.ml.feature_selection.information_value module
- class pyspark_ds_toolbox.ml.feature_selection.information_value.WeightOfEvidenceComputer(inputCol=None, inputCols=None, col_target=None)
Bases:
pyspark.ml.base.Transformer
,pyspark.ml.param.shared.HasInputCol
,pyspark.ml.param.shared.HasOutputCol
,pyspark.ml.param.shared.HasInputCols
,pyspark.ml.param.shared.HasOutputCols
,pyspark.ml.util.DefaultParamsReadable
,pyspark.ml.util.DefaultParamsWritable
A transform to add the weight of evidence value for categorical variables.
This is a class to add the columns with the Weight of Evidence for categorical features. See http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf for a technical description.
- add_woe(dfs: pyspark.sql.dataframe.DataFrame, feature: str, col_target: str) pyspark.sql.dataframe.DataFrame
Function that add teh WOE to the dataset passed to the tranform method of the class.
- Parameters
dfs (DataFrame) – A spark DataFrame with feature and target columns.
feature (str) – Column name of a categorical feature, must be of type string.
col_target (str) – Column name of a target variable, must be of type integer with values 0 and 1.
- Returns
The dfs argument with a column f’{feature}_woe’.
- Return type
DataFrame
- checkParams()
- col_target = Param(parent='undefined', name='col_target', doc='Column name of the target. Must be a integer os values 0 or 1.')
- getTarget()
- setInputCol(new_inputCol)
- setInputCols(new_inputCols)
- setParams(inputCol=None, inputCols=None, col_target=None)
- setTarget(new_target)
- pyspark_ds_toolbox.ml.feature_selection.information_value.compute_woe_iv(dfs: pyspark.sql.dataframe.DataFrame, col_feature: str, col_target: str) Tuple[pyspark.sql.dataframe.DataFrame, float]
Function that given a DataFrame, a categorical feature and a binary target column computes the Information Value.
See http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf for a technical description.
- Parameters
dfs (DataFrame) – A spark DataFrame with col_feature and col_target;
col_feature (str) – Column name of a categorical feature.
col_target (str) – Column name of a binary target column. Must be of integer type and have only value 0 and 1.
- Raises
TypeError – if dfs.schema[col_target].dataType != T.IntegerType
ValueError – unique_target_values != [0, 1]
- Returns
- A two element tuple with the follwing objects:
Spark DataFrame with ‘feature’, ‘feature_value’, ‘woe’, ‘iv’ column;
float with the col_feature information value.
- Return type
Tuple[DataFrame, float]
- pyspark_ds_toolbox.ml.feature_selection.information_value.feature_selection_with_iv(dfs: pyspark.sql.dataframe.DataFrame, col_target: str, num_features: Optional[List[str]], cat_features: Optional[List[str]], floor_iv: float = 0.3, bucket_fraction: float = 0.1, categorical_as_woe: Optional[bool] = False) dict
Function that executes a feature selection based on the information value methodology.
This function computes the information value for the features passed and based on the floor_iv select the features that should ne used in the modeling part. It return analytical dataframes (with the Weight of Evidence and Information Value).
But the main advantage is that it also returns a list os stages to be passed to a pyspark.ml.Pipeline that will encode and assemble the select variables.
See http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf for a technical description and https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html for a introduction on the subject.
- Parameters
dfs (DataFrame) – A spark DataFrame.
col_target (str) – Column name of a binary target column. Must be of integer type and have only value 0 and 1.
num_features (Union[List[str], None]) – List of columns names of numeric features. This column will be transformed with QuantileDiscretizer.
cat_features (Union[List[str], None]) – List of columns names of categorical features;
floor_iv (float, optional) – Threshold for a feature to be selected, greater or equal this value. Defaults to 0.3.
bucket_fraction (float, optional) – Fraction of the dataset to be used to create the buckets Must be between 0.05 and 0.5. Defaults to 0.1.
categorical_as_woe (bool, optional) – If True categorical variables will as Weight of Evidence. If False one hot encoding. Defaults to False.
- Raises
TypeError – if (num_features is None) and (cat_features is None)
ValueError – if (bucket_fraction < 0.05) or (bucket_fraction > 0.5)
- Returns
- A dict with the following structure
dfs_woe: Spark DataFrame with WOE and IV for each feature value. Given by the cols: feature, feature_value, woe, iv.
dfs_iv: Spark DataFrame with IV for each feature. Given by the cols: feature, iv.
stages_features_vector: List with spark transformers that computes the features vector based on the floor_iv and categorical_as_woe params.
- Return type
dict