pyspark_ds_toolbox.ml.feature_importance package
Submodules
pyspark_ds_toolbox.ml.feature_importance.native_spark module
Module with spark native feature importance score tools.
- pyspark_ds_toolbox.ml.feature_importance.native_spark.extract_features_score(model: Union[pyspark.ml.classification.LogisticRegressionModel, pyspark.ml.classification.DecisionTreeClassificationModel, pyspark.ml.classification.RandomForestClassificationModel, pyspark.ml.classification.GBTClassificationModel, pyspark.ml.regression.LinearRegressionModel, pyspark.ml.regression.DecisionTreeRegressionModel, pyspark.ml.regression.RandomForestRegressionModel, pyspark.ml.regression.GBTRegressionModel], dfs: pyspark.sql.dataframe.DataFrame, features_col: str = 'features') pandas.core.frame.DataFrame
- Function that extracts feature importance or coefficients from spark models. - There are 3 possible situations created inside this function:
- 1) If model is a DecisionTree, RandomForest or GBT (classification or regression) the features score will be gini scores. See https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances for a description on the feature importance; - If model is a LogisticRegression the features scores will be the odds ratio; 
- If model is a LinearRegression the features socres will be the coefficients. 
 
 - Parameters
- model (Union[spark models]) – A fitted spark model. Accepted models: DecisionTree, RandomForest, GBT (both classification and regression), LogisticRegression and LinearRegression. 
- dfs (pyspark.sql.dataframe.DataFrame) – The SparkDataFrame used to fit the model. 
- features_col (str, optional) – The features vector column. Defaults to ‘features’. 
 
- Raises
- ValueError – if features_col not in dfs.columns is True. 
- Returns
- A PandasDataFrame with the folloing columns:
- feat_index: The feature index number in the features vector column; 
- feature: The feature name; 
- delta_gini/odds_ratio/coefficient: The feature score depending on the model (see the description). 
 
 
- Return type
- [pd.core.frame.DataFrame] 
 
pyspark_ds_toolbox.ml.feature_importance.shap_values module
This modules implements a way to estimate shap values from SparkDfs.
We are not able to implement a generic function/class for pure spark model, that took in account numerical and categorical features. So this implementation as a workaround to get the job done. The main function is estimate_shap_values() and you can check directly the documentation on this function.
- class pyspark_ds_toolbox.ml.feature_importance.shap_values.ShapValuesPDF(df: pandas.core.frame.DataFrame, id_col: str, target_col: str, cat_features: Union[None, List[str]], sort_metric: str, problem_type: str, max_mem_size: str = '3G', max_models: int = 10, max_runtime_secs: int = 60, nfolds: int = 5, seed: int = 90)
- Bases: - object- H2O AutoML Wrapper - as_h2o_df()
 - extract_shap_values()
 - fit_automl()
 - get_feature_cols()
 - start_h2o()
 
- pyspark_ds_toolbox.ml.feature_importance.shap_values.estimate_shap_values(sdf: pyspark.sql.dataframe.DataFrame, id_col: str, target_col: str, cat_features: Optional[List[str]], sort_metric: str, problem_type: str, subset_size: int = 2000, max_mem_size: str = '2G', max_models: int = 8, max_runtime_secs: int = 30, nfolds: int = 5, seed: int = 90)
- Computes for each row the shap values of each feature. - This function will split the sdf into int(sdf.count()/subset_size) pandas dataframes and then use the Class ShapValuesPDF, which is a wrapper of the H2O automl, on each the subseted dataset using the applyInPandas() method of grouped SparkDF. - Check the following link for an intuition of how this works: https://www.youtube.com/watch?v=x6dSsbXhyPo - Parameters
- sdf (pyspark.sql.dataframe.DataFrame) – A SparkDF. Note that all column, except the id_col must be features. 
- id_col (str) – Column name of the identifier. 
- target_col (str) – Column name of the target value. 
- cat_features (Union[List[str], None]) – List of column names of the categorical variables, if any. 
- sort_metric (str) – A metric to sort the candidates (see https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html). 
- problem_type (str) – ‘regression’ or ‘classification’. If classification then target_col must be of binary values. 
- subset_size (int, optional) – Number of rows for each sub dataset. Defaults to 2000. 
- max_mem_size (str, optional) – Max memory size to be allocated to h2o local cluster. Defaults to ‘2G’. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html#h2o.init) 
- max_models (int, optional) – Max number of model to be fitted. These models are ranked according to sort_metric. Defaults to 8. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml) 
- max_runtime_secs (int, optional) – Max number of seconds to spend fitting the models. Defaults to 30. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml) 
- nfolds (int, optional) – Number of folds to be used for cross validation while fitting. Defaults to 5. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml) 
- seed (int, optional) – Seed. Defaults to 90. 
 
- Raises
- ValueError – if int(sdf.count()/subset_size) < 2 is True 
- Returns
- A sparkDF with the follwing columns
- id_col: The values from the column passed as id_col; 
- feature: The name of each feature from features_names; 
- shap_value: The shap value of the feature. 
 
 
- Return type
- [pyspark.sql.dataframe.DataFrame]