pyspark_ds_toolbox.ml.feature_importance package
Submodules
pyspark_ds_toolbox.ml.feature_importance.native_spark module
Module with spark native feature importance score tools.
- pyspark_ds_toolbox.ml.feature_importance.native_spark.extract_features_score(model: Union[pyspark.ml.classification.LogisticRegressionModel, pyspark.ml.classification.DecisionTreeClassificationModel, pyspark.ml.classification.RandomForestClassificationModel, pyspark.ml.classification.GBTClassificationModel, pyspark.ml.regression.LinearRegressionModel, pyspark.ml.regression.DecisionTreeRegressionModel, pyspark.ml.regression.RandomForestRegressionModel, pyspark.ml.regression.GBTRegressionModel], dfs: pyspark.sql.dataframe.DataFrame, features_col: str = 'features') pandas.core.frame.DataFrame
Function that extracts feature importance or coefficients from spark models.
- There are 3 possible situations created inside this function:
1) If model is a DecisionTree, RandomForest or GBT (classification or regression) the features score will be gini scores. See https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances for a description on the feature importance;
If model is a LogisticRegression the features scores will be the odds ratio;
If model is a LinearRegression the features socres will be the coefficients.
- Parameters
model (Union[spark models]) – A fitted spark model. Accepted models: DecisionTree, RandomForest, GBT (both classification and regression), LogisticRegression and LinearRegression.
dfs (pyspark.sql.dataframe.DataFrame) – The SparkDataFrame used to fit the model.
features_col (str, optional) – The features vector column. Defaults to ‘features’.
- Raises
ValueError – if features_col not in dfs.columns is True.
- Returns
- A PandasDataFrame with the folloing columns:
feat_index: The feature index number in the features vector column;
feature: The feature name;
delta_gini/odds_ratio/coefficient: The feature score depending on the model (see the description).
- Return type
[pd.core.frame.DataFrame]
pyspark_ds_toolbox.ml.feature_importance.shap_values module
This modules implements a way to estimate shap values from SparkDfs.
We are not able to implement a generic function/class for pure spark model, that took in account numerical and categorical features. So this implementation as a workaround to get the job done. The main function is estimate_shap_values() and you can check directly the documentation on this function.
- class pyspark_ds_toolbox.ml.feature_importance.shap_values.ShapValuesPDF(df: pandas.core.frame.DataFrame, id_col: str, target_col: str, cat_features: Union[None, List[str]], sort_metric: str, problem_type: str, max_mem_size: str = '3G', max_models: int = 10, max_runtime_secs: int = 60, nfolds: int = 5, seed: int = 90)
Bases:
object
H2O AutoML Wrapper
- as_h2o_df()
- extract_shap_values()
- fit_automl()
- get_feature_cols()
- start_h2o()
- pyspark_ds_toolbox.ml.feature_importance.shap_values.estimate_shap_values(sdf: pyspark.sql.dataframe.DataFrame, id_col: str, target_col: str, cat_features: Optional[List[str]], sort_metric: str, problem_type: str, subset_size: int = 2000, max_mem_size: str = '2G', max_models: int = 8, max_runtime_secs: int = 30, nfolds: int = 5, seed: int = 90)
Computes for each row the shap values of each feature.
This function will split the sdf into int(sdf.count()/subset_size) pandas dataframes and then use the Class ShapValuesPDF, which is a wrapper of the H2O automl, on each the subseted dataset using the applyInPandas() method of grouped SparkDF.
Check the following link for an intuition of how this works: https://www.youtube.com/watch?v=x6dSsbXhyPo
- Parameters
sdf (pyspark.sql.dataframe.DataFrame) – A SparkDF. Note that all column, except the id_col must be features.
id_col (str) – Column name of the identifier.
target_col (str) – Column name of the target value.
cat_features (Union[List[str], None]) – List of column names of the categorical variables, if any.
sort_metric (str) – A metric to sort the candidates (see https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/sort_metric.html).
problem_type (str) – ‘regression’ or ‘classification’. If classification then target_col must be of binary values.
subset_size (int, optional) – Number of rows for each sub dataset. Defaults to 2000.
max_mem_size (str, optional) – Max memory size to be allocated to h2o local cluster. Defaults to ‘2G’. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html#h2o.init)
max_models (int, optional) – Max number of model to be fitted. These models are ranked according to sort_metric. Defaults to 8. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml)
max_runtime_secs (int, optional) – Max number of seconds to spend fitting the models. Defaults to 30. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml)
nfolds (int, optional) – Number of folds to be used for cross validation while fitting. Defaults to 5. (see https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html?highlight=automl#h2oautoml)
seed (int, optional) – Seed. Defaults to 90.
- Raises
ValueError – if int(sdf.count()/subset_size) < 2 is True
- Returns
- A sparkDF with the follwing columns
id_col: The values from the column passed as id_col;
feature: The name of each feature from features_names;
shap_value: The shap value of the feature.
- Return type
[pyspark.sql.dataframe.DataFrame]