pyspark_ds_toolbox.causal_inference package
Submodules
pyspark_ds_toolbox.causal_inference.diff_in_diff module
Difference in Difference toolbox.
For an introduction on the subject see: https://mixtape.scunning.com/difference-in-differences.html
- pyspark_ds_toolbox.causal_inference.diff_in_diff.did_estimator(df: pyspark.sql.dataframe.DataFrame, id_col: str, y: str, flag_unit: str, flag_time: str, num_features: Union[None, List[str]] = None, cat_features: Union[None, List[str]] = None) dict
- Difference in Difference Estimator. - implementation based on https://matheusfacure.github.io/python-causality-handbook/14-Difference-in-Difference.html. - Parameters
- df (pyspark.sql.dataframe.DataFrame) – SparkDF from which the causal effect will be estimated. 
- id_col (str) – Column name of an unique unit identifier. 
- y (str) – Column name of the outcome of interest. 
- flag_unit (str) – Column name of a flag indicating wheter the unit was treated or not. MUST be 1 or 0. 
- flag_time (str) – Column name of a flag indicating whether the time is before or after the treatment. MUST BE 1 or 0. 
- num_features ([type], optional) – List of numerics features to be used. Defaults to Union[None, List[str]]. 
- cat_features ([type], optional) – List of categorical features to be used. Defaults to None. 
 
- Raises
- ValueError – If id_col, y, flag_unit or flag_time is not in df.columns. 
- Returns
- A dictionary with the following keys and values
- ’impacto_medio’: list(linear_model.coefficients)[0], 
- ’n_ids_impactados’: df_model.select(id_col).distinct().count(), 
- ’impacto’: list(linear_model.coefficients)[0]*df_model.select(id_col).distinct().count(), 
- ’pValueInteraction’: linear_model.summary.pValues[0], 
- ’r2’: linear_model.summary.r2, 
- ’r2adj’: linear_model.summary.r2adj, 
- ’df_with_features’: df_model, 
- ’linear_model’: linear_model 
 
 
- Return type
- [dict] 
 
pyspark_ds_toolbox.causal_inference.ps_matching module
Propensity Score Matching toolbox.
For an introduction on the subject see: https://mixtape.scunning.com/matching-and-subclassification.html#propensity-score-methods
- pyspark_ds_toolbox.causal_inference.ps_matching.compute_propensity_score(df: pyspark.sql.dataframe.DataFrame, treat: str, y: str, id: str, featuresCol: str = 'features', train_size: float = 0.8) tuple
- Computes the propensity score for a given treatment based on the features from df. - Parameters
- df (pyspark.sql.dataframe.DataFrame) – Dataframe with features and treatment assignment. 
- treat (str) – Column name os the treatment indicating column. Must have value 0 or 1. 
- y (str) – Column name of the outcome of interest. 
- id (str) – Id of obervations. 
- featuresCol (str, optional) – Assembled column to be used in a pyspark pipelie. Defaults to ‘features’. 
- train_size (float, optional) – Proportion of the dataset to be used in training. Defaults to 0.8. 
 
- Returns
- 1 a sparkDF with the id, propensity score, treat and y columns;
- 2 a pandasDF with the evaluation of the models used to compute the propensity score. 
 
- Return type
- tuple 
 
- pyspark_ds_toolbox.causal_inference.ps_matching.estimate_causal_effect(df_ps: pyspark.sql.dataframe.DataFrame, y: str, treat: str, ps: str) float
- Function that estimates the ATE based on propensity scores. - The implementation is based on chapter 5 Matching and Subclassification section 5.3 Approximate Matching of the book Causal Inference: the Mixtape (https://mixtape.scunning.com/index.html). - Parameters
- df_ps (pyspark.sql.dataframe.DataFrame) – SparkDF with outcome of interest, treatment and propensity scores. 
- y (str) – Column name of the outcome of interest. 
- treat (str) – Column name indicating receivement of treatment or not. 
- ps (str) – Column name with propensity score. 
 
- Returns
- The ATE. 
- Return type
- float 
 
Module contents
Causal Inference Toolbox
Sub-package dedicated to functionalities related to the field of causal inference. For a introduction on the subject see: https://mixtape.scunning.com/introduction.html#what-is-causal-inference