pyspark_ds_toolbox.ml.feature_importance package

Submodules

pyspark_ds_toolbox.ml.feature_importance.native_spark module

Module with spark native feature importance score tools.

pyspark_ds_toolbox.ml.feature_importance.native_spark.extract_features_score(model: Union[pyspark.ml.classification.LogisticRegressionModel, pyspark.ml.classification.DecisionTreeClassificationModel, pyspark.ml.classification.RandomForestClassificationModel, pyspark.ml.classification.GBTClassificationModel, pyspark.ml.regression.LinearRegressionModel, pyspark.ml.regression.DecisionTreeRegressionModel, pyspark.ml.regression.RandomForestRegressionModel, pyspark.ml.regression.GBTRegressionModel], dfs: pyspark.sql.dataframe.DataFrame, features_col: str = 'features') → pandas.core.frame.DataFrame

Function that extracts feature importance or coefficients from spark models.

There are 3 possible situations created inside this function:

1) If model is a DecisionTree, RandomForest or GBT (classification or regression) the features score will be gini scores. See https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances for a description on the feature importance;

If model is a LogisticRegression the features scores will be the odds ratio;
If model is a LinearRegression the features socres will be the coefficients.

Parameters

model (Union[spark models]) – A fitted spark model. Accepted models: DecisionTree, RandomForest, GBT (both classification and regression), LogisticRegression and LinearRegression.
dfs (pyspark.sql.dataframe.DataFrame) – The SparkDataFrame used to fit the model.
features_col (str, optional) – The features vector column. Defaults to ‘features’.

Raises

ValueError – if features_col not in dfs.columns is True.

Returns

A PandasDataFrame with the folloing columns:

feat_index: The feature index number in the features vector column;
feature: The feature name;
delta_gini/odds_ratio/coefficient: The feature score depending on the model (see the description).

Return type

[pd.core.frame.DataFrame]

pyspark_ds_toolbox.ml.feature_importance.shap_values module

This modules implements a way to estimate shap values from SparkDfs.

We are not able to implement a generic function/class for pure spark model, that took in account numerical and categorical features. So this implementation as a workaround to get the job done. The main function is estimate_shap_values() and you can check directly the documentation on this function.

class pyspark_ds_toolbox.ml.feature_importance.shap_values.ShapValuesPDF(df: pandas.core.frame.DataFrame, id_col: str, target_col: str, cat_features: Union[None, List[str]], sort_metric: str, problem_type: str, max_mem_size: str = '3G', max_models: int = 10, max_runtime_secs: int = 60, nfolds: int = 5, seed: int = 90)

Bases: object

H2O AutoML Wrapper

as_h2o_df()

extract_shap_values()

fit_automl()

get_feature_cols()

start_h2o()

pyspark_ds_toolbox.ml.feature_importance.shap_values.estimate_shap_values(sdf: pyspark.sql.dataframe.DataFrame, id_col: str, target_col: str, cat_features: Optional[List[str]], sort_metric: str, problem_type: str, subset_size: int = 2000, max_mem_size: str = '2G', max_models: int = 8, max_runtime_secs: int = 30, nfolds: int = 5, seed: int = 90)

Computes for each row the shap values of each feature.

This function will split the sdf into int(sdf.count()/subset_size) pandas dataframes and then use the Class ShapValuesPDF, which is a wrapper of the H2O automl, on each the subseted dataset using the applyInPandas() method of grouped SparkDF.

Check the following link for an intuition of how this works: https://www.youtube.com/watch?v=x6dSsbXhyPo

Parameters

sdf (pyspark.sql.dataframe.DataFrame) – A SparkDF. Note that all column, except the id_col must be features.
id_col (str) – Column name of the identifier.
target_col (str) – Column name of the target value.
cat_features (Union[List[str], None]) – List of column names of the categorical variables, if any.
sort_metric (str) – A metric to sort the candidates (see https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html).
problem_type (str) – ‘regression’ or ‘classification’. If classification then target_col must be of binary values.
subset_size (int, optional) – Number of rows for each sub dataset. Defaults to 2000.
max_mem_size (str, optional) – Max memory size to be allocated to h2o local cluster. Defaults to ‘2G’. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html#h2o.init)
max_models (int, optional) – Max number of model to be fitted. These models are ranked according to sort_metric. Defaults to 8. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml)
max_runtime_secs (int, optional) – Max number of seconds to spend fitting the models. Defaults to 30. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml)
nfolds (int, optional) – Number of folds to be used for cross validation while fitting. Defaults to 5. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml)
seed (int, optional) – Seed. Defaults to 90.

Raises

ValueError – if int(sdf.count()/subset_size) < 2 is True

Returns

A sparkDF with the follwing columns

id_col: The values from the column passed as id_col;
feature: The name of each feature from features_names;
shap_value: The shap value of the feature.

Return type

[pyspark.sql.dataframe.DataFrame]

Module contents

Feature Importance tools.

Subpackage dedicated to Feature Importance tools.