pyspark_ds_toolbox.ml.classification package

Submodules

pyspark_ds_toolbox.ml.classification.baseline_classifiers module

A module for baseline classifiers.

pyspark_ds_toolbox.ml.classification.baseline_classifiers.baseline_binary_classfiers(dfs: pyspark.sql.dataframe.DataFrame, dfs_test: pyspark.sql.dataframe.DataFrame, id_col: str, target_col: str, num_features: Optional[List[str]] = None, cat_features: Optional[List[str]] = None, weight_on_target: bool = False, log_mlflow_run: bool = False, mlflow_experiment_name: Union[None, str] = None, artifact_stage_path: Union[None, str] = None) → dict

Function that fit models that could be used as baseline model.

This function will:

Add a features vecto to dfs and dfs_test (see pyspark_ds_toolbox.ml.classification.data_prep.get_features_vector());
Fit the following models LogisticRegression, DecisionTreeClassifier, RandomForestClassifier and GBTClassifier, without any tunning;
Extracts from the trained models the features score (see pyspark_ds_toolbox.ml.feature_importance.native_spark.extract_features_score);
Use the fitted models to predict on the dfs_test;
Compute evaluation metrics on the test data (see pyspark_ds_toolbox.ml.classification.eval module)

Parameters

dfs (pyspark.sql.dataframe.DataFrame) – A training DataFrameSpark.
dfs_test (pyspark.sql.dataframe.DataFrame) – A test DataFrameSpark. Metrics reported are computed with the prediction values from this data.
id_col (str) – Column name of the id. Used to compute the confusion matrix.
target_col (str) – Target to be predicted. Must be of values 1 and 0.
num_features (Union[List[str], None], optional) – List of the columns names of numerical features. Defaults to None.
cat_features (Union[List[str], None], optional) – List of the columns names of the categorical features. Defaults to None.
weight_on_target (bool, optional) – If True will add a class weight based on target_col (see pyspark_ds_toolbox.ml.data_prep.binary_classifier_weights). Defaults to False.
log_mlflow_run (bool, optional) – If True will log params, metrics, confusion matrix, decile table and model in a MLFlow run for each fit. Defaults to False.
mlflow_experiment_name (Union[None, str], optional) – Name of the experiment where the runs should be looged. Defaults to None.
artifact_stage_path (Union[None, str], optional) – Path to write confusion matrix and decile table before logging into mlflow. Defaults to None.

Raises

ValueError – if dfs.schema != dfs_test.schema is True
ValueError – len(set([id_col, target_col]).difference(set(dfs.columns))) != 0 is True

Returns

A dict for each algorithm (keys are LogisticRegression, DecisionTreeClassifier, RandomForestClassifier and GBTClassifier). Each element is a dictionary with the keys

model: The spark trained model;
feature_score: The feature importance of the model (see pyspark.ml.feature_importance.spark_native.extract_features_score);
metrics: dict with confusion_matrix and f1, auc, accuracy, precision, recall and max_ks;
decile_table: Table with a decile analysis on the predicted probabilities.

Return type

[dict]

pyspark_ds_toolbox.ml.classification.eval module

Evaluation toolbox.

Module dedicated to functionalities related to classification evaluation.

pyspark_ds_toolbox.ml.classification.eval.binary_classificator_evaluator(dfs_prediction: pyspark.sql.dataframe.DataFrame, col_target: str, col_prediction: str, print_metrics: bool = False) → dict

Computes the Matrics of a Binary Classifier from a Prediction Output table from spark.

Parameters

dfs_prediction (pyspark.sql.dataframe.DataFrame) – Output Prediction table from spark binarry classifier.
col_target (str) – Column name with the target (ground truth)
print_metrics (bool, optional) – Wether to print or not the metrics in the console. Defaults to False.

Raises

Exception – Any error that is encontered.

Returns

Dict with the following metrics

confusion_matrix
accuracy
precision
recall
f1 score
aucaoc

Return type

[Dict]

pyspark_ds_toolbox.ml.classification.eval.binary_classifier_decile_analysis(dfs: pyspark.sql.dataframe.DataFrame, col_id: str, col_target: str, col_probability: str) → pyspark.sql.dataframe.DataFrame

Computes a Precision, Recall and KS decile analysis from a probability prediction model. col_target column MUST have values of only 0 and 1.

Parameters

dfs (pyspark.sql.dataframe.DataFrame) – SparkDF with probabilities predictions.
col_id (str) – Column name with id value, to count the values for each decile.
col_target (str) – Column name with the ground truth. Must be from a binary classifier with values 1 and 0.
col_probability (str) – Column name with the probability estimated from the model.

Raises

ValueError – If unique values from col_target column are not 0 and 1.

Returns

SparkDF with the columns:

percentile, min_prob, max_prob, count_id, events, non_events, cum_events,

cum_non_events, precision_at_percentile, recall_at_percentile, event_rate, nonevent_rate, cum_eventrate, cum_noneventrate, ks.

Return type

pyspark.sql.dataframe.DataFrame

pyspark_ds_toolbox.ml.classification.eval.get_p1(value)

Module contents

Classification toolbox.

Subpackage dedicated to Classification helpers.