pyspark_ds_toolbox.ml.feature_importance package

Submodules

pyspark_ds_toolbox.ml.feature_importance.native_spark module

Module with spark native feature importance score tools.

pyspark_ds_toolbox.ml.feature_importance.native_spark.extract_features_score(model: Union[pyspark.ml.classification.LogisticRegressionModel, pyspark.ml.classification.DecisionTreeClassificationModel, pyspark.ml.classification.RandomForestClassificationModel, pyspark.ml.classification.GBTClassificationModel, pyspark.ml.regression.LinearRegressionModel, pyspark.ml.regression.DecisionTreeRegressionModel, pyspark.ml.regression.RandomForestRegressionModel, pyspark.ml.regression.GBTRegressionModel], dfs: pyspark.sql.dataframe.DataFrame, features_col: str = 'features') pandas.core.frame.DataFrame

Function that extracts feature importance or coefficients from spark models.

There are 3 possible situations created inside this function:

1) If model is a DecisionTree, RandomForest or GBT (classification or regression) the features score will be gini scores. See https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances for a description on the feature importance;

  1. If model is a LogisticRegression the features scores will be the odds ratio;

  2. If model is a LinearRegression the features socres will be the coefficients.

Parameters
  • model (Union[spark models]) – A fitted spark model. Accepted models: DecisionTree, RandomForest, GBT (both classification and regression), LogisticRegression and LinearRegression.

  • dfs (pyspark.sql.dataframe.DataFrame) – The SparkDataFrame used to fit the model.

  • features_col (str, optional) – The features vector column. Defaults to ‘features’.

Raises

ValueError – if features_col not in dfs.columns is True.

Returns

A PandasDataFrame with the folloing columns:
  • feat_index: The feature index number in the features vector column;

  • feature: The feature name;

  • delta_gini/odds_ratio/coefficient: The feature score depending on the model (see the description).

Return type

[pd.core.frame.DataFrame]

pyspark_ds_toolbox.ml.feature_importance.shap_values module

This modules implements a way to estimate shap values from SparkDfs.

We are not able to implement a generic function/class for pure spark model, that took in account numerical and categorical features. So this implementation as a workaround to get the job done. The main function is estimate_shap_values() and you can check directly the documentation on this function.

class pyspark_ds_toolbox.ml.feature_importance.shap_values.ShapValuesPDF(df: pandas.core.frame.DataFrame, id_col: str, target_col: str, cat_features: Union[None, List[str]], sort_metric: str, problem_type: str, max_mem_size: str = '3G', max_models: int = 10, max_runtime_secs: int = 60, nfolds: int = 5, seed: int = 90)

Bases: object

H2O AutoML Wrapper

as_h2o_df()
extract_shap_values()
fit_automl()
get_feature_cols()
start_h2o()
pyspark_ds_toolbox.ml.feature_importance.shap_values.estimate_shap_values(sdf: pyspark.sql.dataframe.DataFrame, id_col: str, target_col: str, cat_features: Optional[List[str]], sort_metric: str, problem_type: str, subset_size: int = 2000, max_mem_size: str = '2G', max_models: int = 8, max_runtime_secs: int = 30, nfolds: int = 5, seed: int = 90)

Computes for each row the shap values of each feature.

This function will split the sdf into int(sdf.count()/subset_size) pandas dataframes and then use the Class ShapValuesPDF, which is a wrapper of the H2O automl, on each the subseted dataset using the applyInPandas() method of grouped SparkDF.

Check the following link for an intuition of how this works: https://www.youtube.com/watch?v=x6dSsbXhyPo

Parameters
Raises

ValueError – if int(sdf.count()/subset_size) < 2 is True

Returns

A sparkDF with the follwing columns
  • id_col: The values from the column passed as id_col;

  • feature: The name of each feature from features_names;

  • shap_value: The shap value of the feature.

Return type

[pyspark.sql.dataframe.DataFrame]

Module contents

Feature Importance tools.

Subpackage dedicated to Feature Importance tools.