Pyspark DS Toolbox

Lifecycle:experimental PyPI Latest Release CodeFactor Maintainability Codecov test coverage Package Tests Downloads

The objective of the package is to provide a set of tools that helps the daily work of data science with spark. The documentation can be found here and notebooks with usage examples here.

Feel free to contribute :)

Installation

Directly from PyPi:

pip install pyspark-ds-toolbox

or from github, note that installing from github will install the latest development version:

pip install git+https://github.com/viniciusmsousa/pyspark-ds-toolbox.git

Organization

The package organized in a structure based on the nature of the task, such as data wrangling, model/prediction evaluation, and so on.

pyspark_ds_toolbox         # Main Package
├─ causal_inference           # Sub-package dedicated to Causal Inferece
│  ├─ diff_in_diff.py   
│  └─ ps_matching.py    
├─ ml                         # Sub-package dedicated to ML
│  ├─ data_prep                  # Sub-package to ML data preparation tools
│  │  ├─ class_weights.py     
│  │  └─ features_vector.py 
│  ├─ classification             # Sub-package decidated to classification tasks
│  │  ├─ eval.py
│  │  └─ baseline_classifiers.py 
│  ├─ feature_importance         # Sub-package with feature importance tools
│  │  ├─ native_spark.py
│  │  └─ shap_values.py 
│  └─ feature_selection         # Sub-package with feature selection tools
│     └─ information_value.py    
├─ wrangling                  # Sub-package decidated to data wrangling tasks
│  ├─ reshape.py               
│  └─ data_quality.py         
└─ stats                      # Sub-package dedicated to basic statistic functionalities
   └─ association.py