pyspark_ds_toolbox.wrangling package
Submodules
pyspark_ds_toolbox.wrangling.data_quality module
Data Quality toolbox.
Module dedicated to provide data quality tools.
- pyspark_ds_toolbox.wrangling.data_quality.count_percent_missing_rows_per_column(sdf: pyspark.sql.dataframe.DataFrame) pandas.core.frame.DataFrame
Computes the percentage of missing values for each column.
- Parameters
dataframe_spark (pyspark.sql.dataframe.DataFrame) – A Spark DataFrame.
- Returns
A Pandas DF where each row representes the sdf columns and the values represents the percentage of null values.
- Return type
pd.core.frame.DataFrame
pyspark_ds_toolbox.wrangling.reshape module
Reshape Tools.
Module dedicated to function to reshape the datasets.
- pyspark_ds_toolbox.wrangling.reshape.pivot_long(dfs: pyspark.sql.dataframe.DataFrame, key_column_name: str, value_column_name: str, key_columns: list, value_columns: list, print_stack_expr: bool = False) pyspark.sql.dataframe.DataFrame
Function to pivot columns into rows (similar to pandas pivot). See the following link: https://sparkbyexamples.com/pyspark/pyspark-pivot-and-unpivot-dataframe/#:~:text=PySpark%20pivot()%20function%20is,individual%20columns%20with%20distinct%20data.
- Parameters
dfs (pyspark.sql.dataframe.DataFrame) – A sparkDF.
key_column_name (str) – Name for the column that will receive the columns passed in value_columns as values.
value_column_name (str) – Name the column that will contain the values from the column passed in value_columns.
key_columns (list) – List of columns that will not be ‘pivoted’.
value_columns (list) – List os columns to be ‘pivoted’
print_stack_expr (bool, optional) – Print the spark sql expression that executes the pivot. Defaults to False.
- Returns
The dfs argument with the value_columns ‘pivoted’.
- Return type
[pyspark.sql.dataframe.DataFrame]
- pyspark_ds_toolbox.wrangling.reshape.with_start_week(df: pyspark.sql.dataframe.DataFrame, date_col: str, start_day: str = 'sunday')
Function that adds a date column with the week start.
- Parameters
df (pyspark.sql.dataframe.DataFrame) – Spark DataFrame.
date_col (str) – Column name of the column from which the start week date will be computed from. Must be one of: pyspark.sql.types.TimestampType, pyspark.sql.types.DateType.
start_day (str, optional) – Which day should the week start?. Defaults to ‘sunday’.
- Raises
ValueError – If date_col is not from type: pyspark.sql.types.TimestampType.
- Returns
The df argument with a column date column called week.
- Return type
[pyspark.sql.dataframe.DataFrame]
Module contents
Sub-package Dedicated to Data Wrangling tools.