pyspark_ds_toolbox.causal_inference package

Submodules

pyspark_ds_toolbox.causal_inference.diff_in_diff module

Difference in Difference toolbox.

For an introduction on the subject see: https://mixtape.scunning.com/difference-in-differences.html

pyspark_ds_toolbox.causal_inference.diff_in_diff.did_estimator(df: pyspark.sql.dataframe.DataFrame, id_col: str, y: str, flag_unit: str, flag_time: str, num_features: Union[None, List[str]] = None, cat_features: Union[None, List[str]] = None) dict

Difference in Difference Estimator.

implementation based on https://matheusfacure.github.io/python-causality-handbook/14-Difference-in-Difference.html.

Parameters
  • df (pyspark.sql.dataframe.DataFrame) – SparkDF from which the causal effect will be estimated.

  • id_col (str) – Column name of an unique unit identifier.

  • y (str) – Column name of the outcome of interest.

  • flag_unit (str) – Column name of a flag indicating wheter the unit was treated or not. MUST be 1 or 0.

  • flag_time (str) – Column name of a flag indicating whether the time is before or after the treatment. MUST BE 1 or 0.

  • num_features ([type], optional) – List of numerics features to be used. Defaults to Union[None, List[str]].

  • cat_features ([type], optional) – List of categorical features to be used. Defaults to None.

Raises

ValueError – If id_col, y, flag_unit or flag_time is not in df.columns.

Returns

A dictionary with the following keys and values
  • ’impacto_medio’: list(linear_model.coefficients)[0],

  • ’n_ids_impactados’: df_model.select(id_col).distinct().count(),

  • ’impacto’: list(linear_model.coefficients)[0]*df_model.select(id_col).distinct().count(),

  • ’pValueInteraction’: linear_model.summary.pValues[0],

  • ’r2’: linear_model.summary.r2,

  • ’r2adj’: linear_model.summary.r2adj,

  • ’df_with_features’: df_model,

  • ’linear_model’: linear_model

Return type

[dict]

pyspark_ds_toolbox.causal_inference.ps_matching module

Propensity Score Matching toolbox.

For an introduction on the subject see: https://mixtape.scunning.com/matching-and-subclassification.html#propensity-score-methods

pyspark_ds_toolbox.causal_inference.ps_matching.compute_propensity_score(df: pyspark.sql.dataframe.DataFrame, treat: str, y: str, id: str, featuresCol: str = 'features', train_size: float = 0.8) tuple

Computes the propensity score for a given treatment based on the features from df.

Parameters
  • df (pyspark.sql.dataframe.DataFrame) – Dataframe with features and treatment assignment.

  • treat (str) – Column name os the treatment indicating column. Must have value 0 or 1.

  • y (str) – Column name of the outcome of interest.

  • id (str) – Id of obervations.

  • featuresCol (str, optional) – Assembled column to be used in a pyspark pipelie. Defaults to ‘features’.

  • train_size (float, optional) – Proportion of the dataset to be used in training. Defaults to 0.8.

Returns

1 a sparkDF with the id, propensity score, treat and y columns;

2 a pandasDF with the evaluation of the models used to compute the propensity score.

Return type

tuple

pyspark_ds_toolbox.causal_inference.ps_matching.estimate_causal_effect(df_ps: pyspark.sql.dataframe.DataFrame, y: str, treat: str, ps: str) float

Function that estimates the ATE based on propensity scores.

The implementation is based on chapter 5 Matching and Subclassification section 5.3 Approximate Matching of the book Causal Inference: the Mixtape (https://mixtape.scunning.com/index.html).

Parameters
  • df_ps (pyspark.sql.dataframe.DataFrame) – SparkDF with outcome of interest, treatment and propensity scores.

  • y (str) – Column name of the outcome of interest.

  • treat (str) – Column name indicating receivement of treatment or not.

  • ps (str) – Column name with propensity score.

Returns

The ATE.

Return type

float

Module contents

Causal Inference Toolbox

Sub-package dedicated to functionalities related to the field of causal inference. For a introduction on the subject see: https://mixtape.scunning.com/introduction.html#what-is-causal-inference