pyspark_ds_toolbox.causal_inference package

Submodules

pyspark_ds_toolbox.causal_inference.diff_in_diff module

Difference in Difference toolbox.

For an introduction on the subject see: https://mixtape.scunning.com/difference-in-differences.html

pyspark_ds_toolbox.causal_inference.diff_in_diff.did_estimator(df: pyspark.sql.dataframe.DataFrame, id_col: str, y: str, flag_unit: str, flag_time: str, num_features: Union[None, List[str]] = None, cat_features: Union[None, List[str]] = None) → dict

Difference in Difference Estimator.

implementation based on https://matheusfacure.github.io/python-causality-handbook/14-Difference-in-Difference.html.

Parameters

df (pyspark.sql.dataframe.DataFrame) – SparkDF from which the causal effect will be estimated.
id_col (str) – Column name of an unique unit identifier.
y (str) – Column name of the outcome of interest.
flag_unit (str) – Column name of a flag indicating wheter the unit was treated or not. MUST be 1 or 0.
flag_time (str) – Column name of a flag indicating whether the time is before or after the treatment. MUST BE 1 or 0.
num_features ([type], optional) – List of numerics features to be used. Defaults to Union[None, List[str]].
cat_features ([type], optional) – List of categorical features to be used. Defaults to None.

Raises

ValueError – If id_col, y, flag_unit or flag_time is not in df.columns.

Returns

A dictionary with the following keys and values

’impacto_medio’: list(linear_model.coefficients)[0],
’n_ids_impactados’: df_model.select(id_col).distinct().count(),
’impacto’: list(linear_model.coefficients)[0]*df_model.select(id_col).distinct().count(),
’pValueInteraction’: linear_model.summary.pValues[0],
’r2’: linear_model.summary.r2,
’r2adj’: linear_model.summary.r2adj,
’df_with_features’: df_model,
’linear_model’: linear_model

Return type

[dict]

pyspark_ds_toolbox.causal_inference.ps_matching module

Propensity Score Matching toolbox.

For an introduction on the subject see: https://mixtape.scunning.com/matching-and-subclassification.html#propensity-score-methods

pyspark_ds_toolbox.causal_inference.ps_matching.compute_propensity_score(df: pyspark.sql.dataframe.DataFrame, treat: str, y: str, id: str, featuresCol: str = 'features', train_size: float = 0.8) → tuple

Computes the propensity score for a given treatment based on the features from df.

Parameters

df (pyspark.sql.dataframe.DataFrame) – Dataframe with features and treatment assignment.
treat (str) – Column name os the treatment indicating column. Must have value 0 or 1.
y (str) – Column name of the outcome of interest.
id (str) – Id of obervations.
featuresCol (str, optional) – Assembled column to be used in a pyspark pipelie. Defaults to ‘features’.
train_size (float, optional) – Proportion of the dataset to be used in training. Defaults to 0.8.

Returns

1 a sparkDF with the id, propensity score, treat and y columns;: 2 a pandasDF with the evaluation of the models used to compute the propensity score.

Return type

tuple

pyspark_ds_toolbox.causal_inference.ps_matching.estimate_causal_effect(df_ps: pyspark.sql.dataframe.DataFrame, y: str, treat: str, ps: str) → float

Function that estimates the ATE based on propensity scores.

The implementation is based on chapter 5 Matching and Subclassification section 5.3 Approximate Matching of the book Causal Inference: the Mixtape (https://mixtape.scunning.com/index.html).

Parameters

df_ps (pyspark.sql.dataframe.DataFrame) – SparkDF with outcome of interest, treatment and propensity scores.
y (str) – Column name of the outcome of interest.
treat (str) – Column name indicating receivement of treatment or not.
ps (str) – Column name with propensity score.

Returns

The ATE.

Return type

float

Module contents

Causal Inference Toolbox

Sub-package dedicated to functionalities related to the field of causal inference. For a introduction on the subject see: https://mixtape.scunning.com/introduction.html#what-is-causal-inference