pyspark_ds_toolbox.ml.classification package

Submodules

pyspark_ds_toolbox.ml.classification.baseline_classifiers module

A module for baseline classifiers.

pyspark_ds_toolbox.ml.classification.baseline_classifiers.baseline_binary_classfiers(dfs: pyspark.sql.dataframe.DataFrame, dfs_test: pyspark.sql.dataframe.DataFrame, id_col: str, target_col: str, num_features: Optional[List[str]] = None, cat_features: Optional[List[str]] = None, weight_on_target: bool = False, log_mlflow_run: bool = False, mlflow_experiment_name: Union[None, str] = None, artifact_stage_path: Union[None, str] = None) dict

Function that fit models that could be used as baseline model.

This function will:
  1. Add a features vecto to dfs and dfs_test (see pyspark_ds_toolbox.ml.classification.data_prep.get_features_vector());

  2. Fit the following models LogisticRegression, DecisionTreeClassifier, RandomForestClassifier and GBTClassifier, without any tunning;

  3. Extracts from the trained models the features score (see pyspark_ds_toolbox.ml.feature_importance.native_spark.extract_features_score);

  4. Use the fitted models to predict on the dfs_test;

  5. Compute evaluation metrics on the test data (see pyspark_ds_toolbox.ml.classification.eval module)

Parameters
  • dfs (pyspark.sql.dataframe.DataFrame) – A training DataFrameSpark.

  • dfs_test (pyspark.sql.dataframe.DataFrame) – A test DataFrameSpark. Metrics reported are computed with the prediction values from this data.

  • id_col (str) – Column name of the id. Used to compute the confusion matrix.

  • target_col (str) – Target to be predicted. Must be of values 1 and 0.

  • num_features (Union[List[str], None], optional) – List of the columns names of numerical features. Defaults to None.

  • cat_features (Union[List[str], None], optional) – List of the columns names of the categorical features. Defaults to None.

  • weight_on_target (bool, optional) – If True will add a class weight based on target_col (see pyspark_ds_toolbox.ml.data_prep.binary_classifier_weights). Defaults to False.

  • log_mlflow_run (bool, optional) – If True will log params, metrics, confusion matrix, decile table and model in a MLFlow run for each fit. Defaults to False.

  • mlflow_experiment_name (Union[None, str], optional) – Name of the experiment where the runs should be looged. Defaults to None.

  • artifact_stage_path (Union[None, str], optional) – Path to write confusion matrix and decile table before logging into mlflow. Defaults to None.

Raises
  • ValueError – if dfs.schema != dfs_test.schema is True

  • ValueError – len(set([id_col, target_col]).difference(set(dfs.columns))) != 0 is True

Returns

A dict for each algorithm (keys are LogisticRegression, DecisionTreeClassifier, RandomForestClassifier and GBTClassifier). Each element is a dictionary with the keys
  • model: The spark trained model;

  • feature_score: The feature importance of the model (see pyspark.ml.feature_importance.spark_native.extract_features_score);

  • metrics: dict with confusion_matrix and f1, auc, accuracy, precision, recall and max_ks;

  • decile_table: Table with a decile analysis on the predicted probabilities.

Return type

[dict]

pyspark_ds_toolbox.ml.classification.eval module

Evaluation toolbox.

Module dedicated to functionalities related to classification evaluation.

pyspark_ds_toolbox.ml.classification.eval.binary_classificator_evaluator(dfs_prediction: pyspark.sql.dataframe.DataFrame, col_target: str, col_prediction: str, print_metrics: bool = False) dict

Computes the Matrics of a Binary Classifier from a Prediction Output table from spark.

Parameters
  • dfs_prediction (pyspark.sql.dataframe.DataFrame) – Output Prediction table from spark binarry classifier.

  • col_target (str) – Column name with the target (ground truth)

  • print_metrics (bool, optional) – Wether to print or not the metrics in the console. Defaults to False.

Raises

Exception – Any error that is encontered.

Returns

Dict with the following metrics
  • confusion_matrix

  • accuracy

  • precision

  • recall

  • f1 score

  • aucaoc

Return type

[Dict]

pyspark_ds_toolbox.ml.classification.eval.binary_classifier_decile_analysis(dfs: pyspark.sql.dataframe.DataFrame, col_id: str, col_target: str, col_probability: str) pyspark.sql.dataframe.DataFrame

Computes a Precision, Recall and KS decile analysis from a probability prediction model. col_target column MUST have values of only 0 and 1.

Parameters
  • dfs (pyspark.sql.dataframe.DataFrame) – SparkDF with probabilities predictions.

  • col_id (str) – Column name with id value, to count the values for each decile.

  • col_target (str) – Column name with the ground truth. Must be from a binary classifier with values 1 and 0.

  • col_probability (str) – Column name with the probability estimated from the model.

Raises

ValueError – If unique values from col_target column are not 0 and 1.

Returns

SparkDF with the columns:
  • percentile, min_prob, max_prob, count_id, events, non_events, cum_events,

cum_non_events, precision_at_percentile, recall_at_percentile, event_rate, nonevent_rate, cum_eventrate, cum_noneventrate, ks.

Return type

pyspark.sql.dataframe.DataFrame

pyspark_ds_toolbox.ml.classification.eval.get_p1(value)

Module contents

Classification toolbox.

Subpackage dedicated to Classification helpers.