pyspark_ds_toolbox.wrangling package

Submodules

pyspark_ds_toolbox.wrangling.data_quality module

Data Quality toolbox.

Module dedicated to provide data quality tools.

pyspark_ds_toolbox.wrangling.data_quality.count_percent_missing_rows_per_column(sdf: pyspark.sql.dataframe.DataFrame) pandas.core.frame.DataFrame

Computes the percentage of missing values for each column.

Parameters

dataframe_spark (pyspark.sql.dataframe.DataFrame) – A Spark DataFrame.

Returns

A Pandas DF where each row representes the sdf columns and the values represents the percentage of null values.

Return type

pd.core.frame.DataFrame

pyspark_ds_toolbox.wrangling.reshape module

Reshape Tools.

Module dedicated to function to reshape the datasets.

pyspark_ds_toolbox.wrangling.reshape.pivot_long(dfs: pyspark.sql.dataframe.DataFrame, key_column_name: str, value_column_name: str, key_columns: list, value_columns: list, print_stack_expr: bool = False) pyspark.sql.dataframe.DataFrame

Function to pivot columns into rows (similar to pandas pivot). See the following link: https://sparkbyexamples.com/pyspark/pyspark-pivot-and-unpivot-dataframe/#:~:text=PySpark%20pivot()%20function%20is,individual%20columns%20with%20distinct%20data.

Parameters
  • dfs (pyspark.sql.dataframe.DataFrame) – A sparkDF.

  • key_column_name (str) – Name for the column that will receive the columns passed in value_columns as values.

  • value_column_name (str) – Name the column that will contain the values from the column passed in value_columns.

  • key_columns (list) – List of columns that will not be ‘pivoted’.

  • value_columns (list) – List os columns to be ‘pivoted’

  • print_stack_expr (bool, optional) – Print the spark sql expression that executes the pivot. Defaults to False.

Returns

The dfs argument with the value_columns ‘pivoted’.

Return type

[pyspark.sql.dataframe.DataFrame]

pyspark_ds_toolbox.wrangling.reshape.with_start_week(df: pyspark.sql.dataframe.DataFrame, date_col: str, start_day: str = 'sunday')

Function that adds a date column with the week start.

Parameters
  • df (pyspark.sql.dataframe.DataFrame) – Spark DataFrame.

  • date_col (str) – Column name of the column from which the start week date will be computed from. Must be one of: pyspark.sql.types.TimestampType, pyspark.sql.types.DateType.

  • start_day (str, optional) – Which day should the week start?. Defaults to ‘sunday’.

Raises

ValueError – If date_col is not from type: pyspark.sql.types.TimestampType.

Returns

The df argument with a column date column called week.

Return type

[pyspark.sql.dataframe.DataFrame]

Module contents

Sub-package Dedicated to Data Wrangling tools.