pyspark_ds_toolbox.causal_inference package
Submodules
pyspark_ds_toolbox.causal_inference.diff_in_diff module
Difference in Difference toolbox.
For an introduction on the subject see: https://mixtape.scunning.com/difference-in-differences.html
- pyspark_ds_toolbox.causal_inference.diff_in_diff.did_estimator(df: pyspark.sql.dataframe.DataFrame, id_col: str, y: str, flag_unit: str, flag_time: str, num_features: Union[None, List[str]] = None, cat_features: Union[None, List[str]] = None) dict
Difference in Difference Estimator.
implementation based on https://matheusfacure.github.io/python-causality-handbook/14-Difference-in-Difference.html.
- Parameters
df (pyspark.sql.dataframe.DataFrame) – SparkDF from which the causal effect will be estimated.
id_col (str) – Column name of an unique unit identifier.
y (str) – Column name of the outcome of interest.
flag_unit (str) – Column name of a flag indicating wheter the unit was treated or not. MUST be 1 or 0.
flag_time (str) – Column name of a flag indicating whether the time is before or after the treatment. MUST BE 1 or 0.
num_features ([type], optional) – List of numerics features to be used. Defaults to Union[None, List[str]].
cat_features ([type], optional) – List of categorical features to be used. Defaults to None.
- Raises
ValueError – If id_col, y, flag_unit or flag_time is not in df.columns.
- Returns
- A dictionary with the following keys and values
’impacto_medio’: list(linear_model.coefficients)[0],
’n_ids_impactados’: df_model.select(id_col).distinct().count(),
’impacto’: list(linear_model.coefficients)[0]*df_model.select(id_col).distinct().count(),
’pValueInteraction’: linear_model.summary.pValues[0],
’r2’: linear_model.summary.r2,
’r2adj’: linear_model.summary.r2adj,
’df_with_features’: df_model,
’linear_model’: linear_model
- Return type
[dict]
pyspark_ds_toolbox.causal_inference.ps_matching module
Propensity Score Matching toolbox.
For an introduction on the subject see: https://mixtape.scunning.com/matching-and-subclassification.html#propensity-score-methods
- pyspark_ds_toolbox.causal_inference.ps_matching.compute_propensity_score(df: pyspark.sql.dataframe.DataFrame, treat: str, y: str, id: str, featuresCol: str = 'features', train_size: float = 0.8) tuple
Computes the propensity score for a given treatment based on the features from df.
- Parameters
df (pyspark.sql.dataframe.DataFrame) – Dataframe with features and treatment assignment.
treat (str) – Column name os the treatment indicating column. Must have value 0 or 1.
y (str) – Column name of the outcome of interest.
id (str) – Id of obervations.
featuresCol (str, optional) – Assembled column to be used in a pyspark pipelie. Defaults to ‘features’.
train_size (float, optional) – Proportion of the dataset to be used in training. Defaults to 0.8.
- Returns
- 1 a sparkDF with the id, propensity score, treat and y columns;
2 a pandasDF with the evaluation of the models used to compute the propensity score.
- Return type
tuple
- pyspark_ds_toolbox.causal_inference.ps_matching.estimate_causal_effect(df_ps: pyspark.sql.dataframe.DataFrame, y: str, treat: str, ps: str) float
Function that estimates the ATE based on propensity scores.
The implementation is based on chapter 5 Matching and Subclassification section 5.3 Approximate Matching of the book Causal Inference: the Mixtape (https://mixtape.scunning.com/index.html).
- Parameters
df_ps (pyspark.sql.dataframe.DataFrame) – SparkDF with outcome of interest, treatment and propensity scores.
y (str) – Column name of the outcome of interest.
treat (str) – Column name indicating receivement of treatment or not.
ps (str) – Column name with propensity score.
- Returns
The ATE.
- Return type
float
Module contents
Causal Inference Toolbox
Sub-package dedicated to functionalities related to the field of causal inference. For a introduction on the subject see: https://mixtape.scunning.com/introduction.html#what-is-causal-inference