This is a walk through of how the package is intended to be used with a practical example.

The Dataset

The first thing that a forecast needs a data to be forecasted. The SynthCast provides a example of how it expected a dataset to look like, the code bellow loads the package and the example dataset:

library(knitr)
library(SynthCast)
data('df_example')
kable(head(df_example)) 
unit time_period x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28
1 1 0.4279268 0.2329316 0.4531898 0.5010649 0.0140657 0.5 0.0103704 0.0126492 0.0061209 0.0016722 0.0020701 0.0229175 0.1717596 0.0028440 0.2961483 0.2777202 0.0179579 0.5 0.0186335 0.0196256 0.0140659 0.5 0.0191083 0.0193874 0.0280014 0.5 0.0062926 0.0193874
1 2 0.3923215 0.0661752 0.4300946 0.4639223 0.1523873 0.5 0.0167901 0.1340623 0.0940312 0.0016722 0.0063536 0.0896040 0.1362349 0.0028440 0.2961483 0.2352990 0.1657939 0.5 0.1428571 0.1479287 0.1589145 0.5 0.1974522 0.1750037 0.1949374 0.5 0.0181592 0.1750037
1 3 0.4420440 0.1649872 0.4336537 0.5034269 0.2919640 0.5 0.0395062 0.2602215 0.1796289 0.0016722 0.0137895 0.1695727 0.1045988 0.0028440 0.2961483 0.2088865 0.3180237 0.5 0.3167702 0.2890312 0.3442300 0.5 0.3949045 0.3201550 0.2198580 0.5 0.0167533 0.3201550
1 4 0.4545717 0.1076923 0.4433019 0.5427364 0.4315704 0.5 0.0501235 0.3791298 0.2685505 0.0016722 0.0172917 0.2420208 0.0822586 0.0028440 0.2961483 0.1556901 0.4694968 0.5 0.4223602 0.4250857 0.5346481 0.5 0.5859873 0.4600435 0.2291281 0.5 0.0072638 0.4600435
1 5 0.4223203 0.1391912 0.4767905 0.5474351 0.5673960 0.5 0.0501235 0.4999604 0.3522328 0.1638796 0.0279551 0.3139178 0.0689121 0.2787148 0.0835851 0.1119981 0.6177005 0.5 0.6149068 0.5627327 0.7247700 0.5 0.7834395 0.5979929 0.2351954 0.5 0.0072638 0.5979929
1 6 0.3827364 0.1078405 0.5021293 0.5456524 0.6992290 0.5 0.0688889 0.6161397 0.4334900 0.3311037 0.0335161 0.3829171 0.0602702 0.2787148 0.0835851 0.0985164 0.7600335 0.5 0.7826087 0.6957559 0.9102858 0.5 0.9745223 0.7413431 0.2458748 0.5 0.0072638 0.7413431

The dataset is expected to have 3 types of columns:

    1. A unit column: containing a numeric identification of the unit. In the credit card example this could the the customer, a group of customer, etc.,;
    1. A time columns: containing the time in integer. In the credit card example this would be the age in months of the respective unit (say 1 for first month, 2 for the second month, etc.,);
    1. Feature Columns: Numeric features, with both the serie(s) that will be forecasted as well as features to use to forecast. In the credit card this could be the profitability and transactional features.

The table bellow shows the max time for each unit:

library(dplyr)

df_example %>%
  group_by(unit) %>%
  summarise(max_time_period=max(time_period)) %>%
  filter(unit %in% c(1, 2, 3, 4, 5, 45, 46, 47, 48, 49, 50)) %>% 
  kable()
unit max_time_period
1 50
2 49
3 48
4 47
5 46
45 6
46 5
47 4
48 3
49 2
50 1

As one can see the older unit (the smaller the number the older the unit is) the longer is the time series that are available (larger values in the time_period column). This means that the data from older units can be used to forecast the younger units. For example, the data from units 18 to 1 could be used to predict the next 12 periods of the unit 30. This is excatly what the function run_synthetic_forecast() does (To better understand how it is working under the hood it is recommend to check the Synthetic Control Synth Package paper.).

The function call bellow runs a synthetic forecast of 12 time periods of the series x1 of the unit 30.

synthetic_forecast <- run_synthetic_forecast(
  df = df_example,
  col_unit_name = 'unit',
  col_time='time_period',
  periods_to_forecast=12,
  unit_of_interest = '30',
  serie_of_interest = 'x1'
)
#> [1] "Forecasting Unit:  30 . Serie:  x1"
#> 
#> X1, X0, Z1, Z0 all come directly from dataprep object.
#> 
#> 
#> **************** 
#>  searching for synthetic control unit  
#>  
#> 
#> **************** 
#> **************** 
#> **************** 
#> 
#> MSPE (LOSS V): 0.005105562 
#> 
#> solution.v:
#>  0.03795838 0.02953412 0.03356642 0.01533716 0.1226315 0.1285906 0.05816525 0.02318678 0.01465216 0.01080646 0.06187415 0.0289542 0.01702719 0.08006876 0.009607601 0.01627082 0.1278952 0.02615566 0.01342692 0.04431671 0.04468165 0.01097563 0.04431671 
#> 
#> solution.w:
#>  1.1452e-05 0.0002666085 0.0001873182 0.0002686277 0.0001636778 0.0003347625 0.0004905744 0.0005939929 0.0005203545 0.5502766 4.8739e-06 0.0007708196 0.0003844661 0.001002508 0.000803214 0.000999687 0.1285913 0.3143298

The output of the function is a list with 4 tables.

Synthetic Forecat Results

These are the 4 tables that are returned by the function call.

Table 1: synthetic_control_composition

This table summarizes the results related to the unit selection from the Synthetic Control method. The columns are the following:

kable(synthetic_forecast$synthetic_control_composition)
execution_date projected_unit projected_serie synthetic_units w.weights
2022-03-08 30 x1 10 0.550
2022-03-08 30 x1 18 0.314
2022-03-08 30 x1 17 0.129
2022-03-08 30 x1 8 0.001
2022-03-08 30 x1 9 0.001
2022-03-08 30 x1 12 0.001
2022-03-08 30 x1 14 0.001
2022-03-08 30 x1 15 0.001
2022-03-08 30 x1 16 0.001
  • execution_date: The date that the forecast was executed in the YYYY-MM-DD format;
  • projected_unit: The forcasted unit;
  • projected_serie: The forecasted serie;
  • synthetic_units/w.weights: the units (from 18 to 1) selected and their recpective weights.

Table 2: variable_importance_and_comparison

This table summarizes the results related to the features/variables selection from the Synthetic Control method. The columns are the following:

kable(head(synthetic_forecast$variable_importance_and_comparison,8))
execution_date projected_unit projected_serie variable unit_of_interest synthetic sample v.weights
2022-03-08 30 x1 x8 0.474 0.550 0.567 0.129
2022-03-08 30 x1 x20 0.537 0.613 0.630 0.128
2022-03-08 30 x1 x7 0.110 0.099 0.088 0.123
2022-03-08 30 x1 x16 0.289 0.192 0.168 0.080
2022-03-08 30 x1 x13 0.237 0.145 0.116 0.062
2022-03-08 30 x1 x9 0.443 0.441 0.433 0.058
2022-03-08 30 x1 x25 0.517 0.370 0.317 0.045
2022-03-08 30 x1 x24 0.729 0.709 0.699 0.044
  • execution_date: The date that the forecast was executed in the YYYY-MM-DD format;
  • projected_unit: The forcasted unit;
  • projected_serie: The forecasted serie;
  • variable: The variable selected;
  • unit_of_interest: The mean value over time of the variable in column variable from the unit in the projected_unit;
  • synthetic: The mean value over time of the variable in column variable of the syntehtic unit;
  • sample: The mean value over time of the variable in column variable of the whole dataset;
  • v.weights: The weight of the variable in the column variable.

Table 3: mape_backtest

This table depicts the results of a simple mape back test on the period it was used to forecast. It is worth noting that the intention is not to provide a robust method for validation the model. The Synthetic Control Method is a mathematical approach, not an machine learning, that minimizes the distance without worrying about overfitting the curves. The columns are the following:

kable(synthetic_forecast$mape_backtest)
execution_date projected_unit projected_serie max_time_unit_of_interest periods_to_forecast elegible_control_units number_control_units mape
2022-03-08 30 x1 21 12 17 9 13.00928
  • execution_date: The date that the forecast was executed in the YYYY-MM-DD format;
  • projected_unit: The forcasted unit;
  • projected_serie: The forecasted serie;
  • max_time_unit_of_interest: The age of the unit of interest;
  • periods_to_forecast: Periods that were forecasted;
  • elegible_control_units: Number of elegible units to be used to forecast;
  • mape: The mean absolute percentage error in the from 1 to max_time_unit_of_interest.

Table 4: output_projecao

This tables contains the projection itself. The columns are the following:

kable(synthetic_forecast$output_projecao)
execution_date projected_unit time_period projected_serie_value is_projected projected_serie
2022-03-08 30 1 0.4354680 0 x1
2022-03-08 30 2 0.4321821 0 x1
2022-03-08 30 3 0.5256354 0 x1
2022-03-08 30 4 0.4840789 0 x1
2022-03-08 30 5 0.3801790 0 x1
2022-03-08 30 6 0.2640425 0 x1
2022-03-08 30 7 0.1495329 0 x1
2022-03-08 30 8 0.2581808 0 x1
2022-03-08 30 9 0.2937315 0 x1
2022-03-08 30 10 0.3000216 0 x1
2022-03-08 30 11 0.3381660 0 x1
2022-03-08 30 12 0.3035805 0 x1
2022-03-08 30 13 0.2989308 0 x1
2022-03-08 30 14 0.6051545 0 x1
2022-03-08 30 15 0.3462337 0 x1
2022-03-08 30 16 0.3895760 0 x1
2022-03-08 30 17 0.4199159 0 x1
2022-03-08 30 18 0.4777851 0 x1
2022-03-08 30 19 0.5354843 0 x1
2022-03-08 30 20 0.4860005 0 x1
2022-03-08 30 21 0.4963447 0 x1
2022-03-08 30 22 0.4928737 1 x1
2022-03-08 30 23 0.4534551 1 x1
2022-03-08 30 24 0.4750725 1 x1
2022-03-08 30 25 0.4928884 1 x1
2022-03-08 30 26 0.7005200 1 x1
2022-03-08 30 27 0.3911140 1 x1
2022-03-08 30 28 0.4438282 1 x1
2022-03-08 30 29 0.4673172 1 x1
2022-03-08 30 30 0.4722184 1 x1
2022-03-08 30 31 0.4898868 1 x1
2022-03-08 30 32 0.5014260 1 x1
2022-03-08 30 33 0.4313357 1 x1
  • execution_date: The date that the forecast was executed in the YYYY-MM-DD format;
  • projected_unit: The forcasted unit;
  • time_period: The time period;
  • projected_serie: The forecasted serie;
  • projected_serie_value: The value of the seria/variable that was projected, from colun projected_serie;
  • is_projected: 1 indicates that the value is projected, 0 indicates that the value is observed.
proj<- synthetic_forecast$output_projecao
proj %>% glimpse()
#> Rows: 33
#> Columns: 6
#> $ execution_date        <chr> "2022-03-08", "2022-03-08", "2022-03-08", "2022-…
#> $ projected_unit        <chr> "30", "30", "30", "30", "30", "30", "30", "30", …
#> $ time_period           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
#> $ projected_serie_value <dbl> 0.4354680, 0.4321821, 0.5256354, 0.4840789, 0.38…
#> $ is_projected          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ projected_serie       <chr> "x1", "x1", "x1", "x1", "x1", "x1", "x1", "x1", …