pyspark_ds_toolbox.stats package
Submodules
pyspark_ds_toolbox.stats.association module
Module Dedicated to Association Metrics
The class implemented in this module is based on the book: Morettin, P.A. and Bussab, W.O., 2017. Estatística básica. Saraiva Educação SA.
- class pyspark_ds_toolbox.stats.association.Association
Bases:
object
This association class implements different types of association metrics, for both categorical and numerical variables.
The class implemented in this module is based on the book: Morettin, P.A. and Bussab, W.O., 2017. Estatística básica. Saraiva Educação SA.
The current implementation is built on top of Koalas, but in the future this will be change to pure pyspark.
- C(df: pyspark.pandas.frame.DataFrame, columns: List[str], dense: bool = True) float
Computes the Contingency Coefficient. A non-normalized association metric between two categorical variables.
- Parameters
df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()) contaning the data in dense form or in grouped form.
columns (List[str]) – List of Strings containing the name of the categorical columns which will be used in the coefficient computation.
dense (bool, optional) – If false it is expected that you have executed this df.groupby(columns).size().unstack(level=0).fillna(0) in the df. Defaults to True.
- Returns
The Contingency Coefficient value.
- Return type
float
- R2(df: pyspark.pandas.frame.DataFrame, categorical: str, numerical: str) float
Computes the R2 Metric for one numeric column and one categorical column.
Metric of association between a numeric and a categorical variables.
- Parameters
df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()).
categorical (str) – Column name of the categorical column.
numerical (str) – column name of the numerical column.
- Returns
The R2 Metric.
- Return type
float
- T(df: pyspark.pandas.frame.DataFrame, columns: List[str], dense: bool = True) float
Computes the T Metrics. A normalized metric of assiciation between two categorical variables.
- Parameters
df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()) contaning the data in dense form or in grouped form.
columns (List[str]) – List of Strings containing the name of the categorical columns which will be used in the coefficient computation.
dense (bool, optional) – If false it is expected that you have executed this df.groupby(columns).size().unstack(level=0).fillna(0) in the df.
- Returns
The T Metric.
- Return type
float
- association_matrix(df: pyspark.pandas.frame.DataFrame, categorical_features: Optional[List[str]] = None, numerical_features: Optional[List[str]] = None, plot_matrix: bool = True, return_matrix: bool = False) Union[None, pandas.core.frame.DataFrame]
Computes from a df, a list of categorical and a list of numerical variables a normalized association matrix.
- Parameters
df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()).
categorical_features (List[str]) – List of column names of the categorical features.
numerical_features (List[str]) – List of Column names of the numerical features.
plot_matrix (bool, optional) – If set False it will not plot the matrix. Defaults to True.
return_matrix (bool, optional) – If set to True it will return the correlation matrix as a pandasDF. Defaults to False.
- Raises
ValueError – if (categorical_features is None) and (numerical_features is None) is True
- Returns
Either None, if return_matrix is False, or a PandasDF with the correlation coefficients.
- Return type
Union[None, pd.core.frame.DataFrame]
- corr(df: pyspark.pandas.frame.DataFrame, columns: List[str]) float
Computes the Correlation Coefficient.
Standard Correlation Coefficient.
- Parameters
df (databricks.koalas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()).
columns (List[str]) – List of numeric column names from which the correlation will computed.
- Returns
Correlation Coefficient.
- Return type
float