pyspark_ds_toolbox.stats package

Submodules

pyspark_ds_toolbox.stats.association module

Module Dedicated to Association Metrics

The class implemented in this module is based on the book: Morettin, P.A. and Bussab, W.O., 2017. Estatística básica. Saraiva Educação SA.

class pyspark_ds_toolbox.stats.association.Association

Bases: object

This association class implements different types of association metrics, for both categorical and numerical variables.

The class implemented in this module is based on the book: Morettin, P.A. and Bussab, W.O., 2017. Estatística básica. Saraiva Educação SA.

The current implementation is built on top of Koalas, but in the future this will be change to pure pyspark.

C(df: pyspark.pandas.frame.DataFrame, columns: List[str], dense: bool = True) → float

Computes the Contingency Coefficient. A non-normalized association metric between two categorical variables.

Parameters

df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()) contaning the data in dense form or in grouped form.
columns (List[str]) – List of Strings containing the name of the categorical columns which will be used in the coefficient computation.
dense (bool, optional) – If false it is expected that you have executed this df.groupby(columns).size().unstack(level=0).fillna(0) in the df. Defaults to True.

Returns

The Contingency Coefficient value.

Return type

float

R2(df: pyspark.pandas.frame.DataFrame, categorical: str, numerical: str) → float

Computes the R2 Metric for one numeric column and one categorical column.

Metric of association between a numeric and a categorical variables.

Parameters

df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()).
categorical (str) – Column name of the categorical column.
numerical (str) – column name of the numerical column.

Returns

The R2 Metric.

Return type

float

T(df: pyspark.pandas.frame.DataFrame, columns: List[str], dense: bool = True) → float

Computes the T Metrics. A normalized metric of assiciation between two categorical variables.

Parameters

df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()) contaning the data in dense form or in grouped form.
columns (List[str]) – List of Strings containing the name of the categorical columns which will be used in the coefficient computation.
dense (bool, optional) – If false it is expected that you have executed this df.groupby(columns).size().unstack(level=0).fillna(0) in the df.

Returns

The T Metric.

Return type

float

association_matrix(df: pyspark.pandas.frame.DataFrame, categorical_features: Optional[List[str]] = None, numerical_features: Optional[List[str]] = None, plot_matrix: bool = True, return_matrix: bool = False) → Union[None, pandas.core.frame.DataFrame]

Computes from a df, a list of categorical and a list of numerical variables a normalized association matrix.

Parameters

df (pyspark.pandas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()).
categorical_features (List[str]) – List of column names of the categorical features.
numerical_features (List[str]) – List of Column names of the numerical features.
plot_matrix (bool, optional) – If set False it will not plot the matrix. Defaults to True.
return_matrix (bool, optional) – If set to True it will return the correlation matrix as a pandasDF. Defaults to False.

Raises

ValueError – if (categorical_features is None) and (numerical_features is None) is True

Returns

Either None, if return_matrix is False, or a PandasDF with the correlation coefficients.

Return type

Union[None, pd.core.frame.DataFrame]

corr(df: pyspark.pandas.frame.DataFrame, columns: List[str]) → float

Computes the Correlation Coefficient.

Standard Correlation Coefficient.

Parameters

df (databricks.koalas.frame.DataFrame) – A PandasOnSParkDF (sparkDF.to_pandas_on_spark()).
columns (List[str]) – List of numeric column names from which the correlation will computed.

Returns

Correlation Coefficient.

Return type

float

Module contents

Statistics toolbox.

Subpackage dedicated to statistics helpers.