ds_toolbox package¶
Subpackages¶
Submodules¶
ds_toolbox.statistics module¶
Statistics
This module contains a series of statistical tests built on top of Pandas and SparkDF to help the computation and interpretation of the results.
Each function has its own description and, when relevant, the source of the test.
- ds_toolbox.statistics.ab_test(g1: str, g2: str, g1_mean: float, g1_std: float, g1_var: float, g1_count: int, g2_mean: float, g2_std: float, g2_var: float, g2_count: int, confidence=0.95, h0=0) → pandas.core.frame.DataFrame¶
Internal Function. Please refer to ab_test_pairwise.
- Args:
g1 (str): Group 1 Identifier. g2 (str): Group 2 Identifier. g1_mean (float): Group 1 Mean of Variable of interest. g1_std (float): Group 1 Standard Deviation os Variable of interest. g1_var (float): Group 1 Variance of variable of interest g1_count (int): Group 1 number of observations. g2_mean (float): Same as Group 1 but for Group 2. g2_std (float): Same as Group 1 but for Group 2. g2_var (float): Same as Group 1 but for Group 2. g2_count (int): Same as Group 1 but for Group 2. confidence (float, optional): Desired Confidence Level. Defaults to 0.95. h0 (int, optional): Null hypothesis difference between Group 1 and Group 2 in variable of interest. Defaults to 0.
- Raises:
Exception: Any error will be raised as an exception.
- Returns:
- pd.DataFrame: DataFrame with the columns
Group1
Group2
Group1_{confidence*100}_Percent_CI
Group2_{confidence*100}_Percent_CI
Group1_Minus_Group2_{confidence*100}_Percent_CI
Z_statistic
P_Value
- ds_toolbox.statistics.ab_test_pairwise(df: Union[pandas.core.frame.DataFrame, pyspark.sql.dataframe.DataFrame], col_group: str, col_variable: str, confidence: float = 0.95, h0: float = 0) → pandas.core.frame.DataFrame¶
Function that computes a simple AB test (based on mean, std and var) for each pair of a categorical column. Works with Both PandasDF and SparkDF.
- Args:
df (Union[pd.DataFrame, pyspark.sql.dataframe.DataFrame]): Data Frame with a group column and a numeric variable column. col_group (str): Column name with group column. Distinct values of this column will the used as comparison in pairwise. col_variable (str): Variable column. Numeric column with the output to be compared between the group column. confidence (float, optional): Desired Confidence Level. Defaults to 0.95. h0 (float, optional): Null hypothesis value. Defaults to 0.
- Raises:
Exception: Erros.
- pd.DataFrame: DataFrame one row per possible pair between values from col_group and the following columns
Group1
Group2
Group1_{confidence*100}_Percent_CI
Group2_{confidence*100}_Percent_CI
Group1_Minus_Group2_{confidence*100}_Percent_CI
Z_statistic
P_Value
- ds_toolbox.statistics.contigency_chi2_test(df: pandas.core.frame.DataFrame, col_interest: str, col_groups: str) → Tuple¶
Compute the Chi-Squared Contigency Table test, returning a formated DataFrame.
- Args:
df (pd.DataFrame): DataFrame that contains the columns to create a contigency table. col_interest (str): Column name of the column of interest (the values from these colum will spreaded into columns to be compared). col_groups (str): Colum with the groups to be compared.
- Returns:
tuple: First element is the scipy object with the test result. Second element is an analytical DataFrame.
- ds_toolbox.statistics.ks_test(df: Union[pandas.core.frame.DataFrame, pyspark.sql.dataframe.DataFrame], col_target: str, col_probability: str, spark: Optional[pyspark.sql.session.SparkSession] = None, max_mem: int = 2, n_cores: int = 4) → Dict¶
Function to compute a Ks Test and depicts a detailed result table. https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
- Args:
df (Union[pd.DataFrame, pyspark.sql.dataframe.DataFrame]): DataFrame with Probability and Classification value. col_target (str): Column name with classification value. col_probability (int): Column name with probability values. max_mem (int, optional): Max memory to be allocated to local spark if df is pandasDF. n_cores (int, optional): Number of cores to be allocated to local spark if df is pandas.
- Raises:
ValueError: If df is sparkDF an spark is None.
- Returns:
Dict: Dict with ‘ks_table’: table with the results and ‘max_ks’: Max KS Value.
- ds_toolbox.statistics.mannwhitney_pairwise(df: pandas.core.frame.DataFrame, col_group: str, col_variable: str, p_value_threshold: float = 0.05) → pandas.core.frame.DataFrame¶
Function to Compute a pairwise Mann Whitney Test.
- Args:
df (pd.DataFrame): DataFrame with value column and group column to compute the test. col_group (str): Column name of the groups to be compared. col_variable (str): Columns name of the numeric variable
to be compared.
- p_value_threshold (float, optional): Threshold to compare
the p-value with. Defaults to 0.05.
- Returns:
- pd.DataFrame: DataFrame with the columns: group1, group2, variable,
mean_variable_group1, mean_variable_group2, mw_pvalue, conclusion, group1_more_profitable.
ds_toolbox.utils module¶
Utils
Module that contains utilities to support the other modules functionalities.
- ds_toolbox.utils.start_local_spark(max_mem: int = 3, n_cores: int = 2) → pyspark.sql.session.SparkSession¶
Starts a local spark session. Used to convert pandas DF into spark df for computing a few tests and metrics.
- Args:
max_mem (int, optional): Max memory to be allocated. Defaults to 3. n_cores (int, optional): Number os cores to be allocated. Defaults to 2.
- Raises:
Exception: Every error.
- Returns:
pyspark.sql.session.SparkSession: spark session object.