how-to-run-synthetic-forecast.Rmd
This is a walk through of how the package is intended to be used with a practical example.
The first thing that a forecast needs a data to be forecasted. The SynthCast provides a example of how it expected a dataset to look like, the code bellow loads the package and the example dataset:
unit | time_period | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0.4279268 | 0.2329316 | 0.4531898 | 0.5010649 | 0.0140657 | 0.5 | 0.0103704 | 0.0126492 | 0.0061209 | 0.0016722 | 0.0020701 | 0.0229175 | 0.1717596 | 0.0028440 | 0.2961483 | 0.2777202 | 0.0179579 | 0.5 | 0.0186335 | 0.0196256 | 0.0140659 | 0.5 | 0.0191083 | 0.0193874 | 0.0280014 | 0.5 | 0.0062926 | 0.0193874 |
1 | 2 | 0.3923215 | 0.0661752 | 0.4300946 | 0.4639223 | 0.1523873 | 0.5 | 0.0167901 | 0.1340623 | 0.0940312 | 0.0016722 | 0.0063536 | 0.0896040 | 0.1362349 | 0.0028440 | 0.2961483 | 0.2352990 | 0.1657939 | 0.5 | 0.1428571 | 0.1479287 | 0.1589145 | 0.5 | 0.1974522 | 0.1750037 | 0.1949374 | 0.5 | 0.0181592 | 0.1750037 |
1 | 3 | 0.4420440 | 0.1649872 | 0.4336537 | 0.5034269 | 0.2919640 | 0.5 | 0.0395062 | 0.2602215 | 0.1796289 | 0.0016722 | 0.0137895 | 0.1695727 | 0.1045988 | 0.0028440 | 0.2961483 | 0.2088865 | 0.3180237 | 0.5 | 0.3167702 | 0.2890312 | 0.3442300 | 0.5 | 0.3949045 | 0.3201550 | 0.2198580 | 0.5 | 0.0167533 | 0.3201550 |
1 | 4 | 0.4545717 | 0.1076923 | 0.4433019 | 0.5427364 | 0.4315704 | 0.5 | 0.0501235 | 0.3791298 | 0.2685505 | 0.0016722 | 0.0172917 | 0.2420208 | 0.0822586 | 0.0028440 | 0.2961483 | 0.1556901 | 0.4694968 | 0.5 | 0.4223602 | 0.4250857 | 0.5346481 | 0.5 | 0.5859873 | 0.4600435 | 0.2291281 | 0.5 | 0.0072638 | 0.4600435 |
1 | 5 | 0.4223203 | 0.1391912 | 0.4767905 | 0.5474351 | 0.5673960 | 0.5 | 0.0501235 | 0.4999604 | 0.3522328 | 0.1638796 | 0.0279551 | 0.3139178 | 0.0689121 | 0.2787148 | 0.0835851 | 0.1119981 | 0.6177005 | 0.5 | 0.6149068 | 0.5627327 | 0.7247700 | 0.5 | 0.7834395 | 0.5979929 | 0.2351954 | 0.5 | 0.0072638 | 0.5979929 |
1 | 6 | 0.3827364 | 0.1078405 | 0.5021293 | 0.5456524 | 0.6992290 | 0.5 | 0.0688889 | 0.6161397 | 0.4334900 | 0.3311037 | 0.0335161 | 0.3829171 | 0.0602702 | 0.2787148 | 0.0835851 | 0.0985164 | 0.7600335 | 0.5 | 0.7826087 | 0.6957559 | 0.9102858 | 0.5 | 0.9745223 | 0.7413431 | 0.2458748 | 0.5 | 0.0072638 | 0.7413431 |
The dataset is expected to have 3 types of columns:
The table bellow shows the max time for each unit:
library(dplyr)
df_example %>%
group_by(unit) %>%
summarise(max_time_period=max(time_period)) %>%
filter(unit %in% c(1, 2, 3, 4, 5, 45, 46, 47, 48, 49, 50)) %>%
kable()
unit | max_time_period |
---|---|
1 | 50 |
2 | 49 |
3 | 48 |
4 | 47 |
5 | 46 |
45 | 6 |
46 | 5 |
47 | 4 |
48 | 3 |
49 | 2 |
50 | 1 |
As one can see the older unit (the smaller the number the older the unit is) the longer is the time series that are available (larger values in the time_period
column). This means that the data from older units can be used to forecast the younger units. For example, the data from units 18
to 1
could be used to predict the next 12
periods of the unit 30
. This is excatly what the function run_synthetic_forecast()
does (To better understand how it is working under the hood it is recommend to check the Synthetic Control Synth Package paper.).
The function call bellow runs a synthetic forecast of 12
time periods of the series x1
of the unit 30.
synthetic_forecast <- run_synthetic_forecast(
df = df_example,
col_unit_name = 'unit',
col_time='time_period',
periods_to_forecast=12,
unit_of_interest = '30',
serie_of_interest = 'x1'
)
#> [1] "Forecasting Unit: 30 . Serie: x1"
#>
#> X1, X0, Z1, Z0 all come directly from dataprep object.
#>
#>
#> ****************
#> searching for synthetic control unit
#>
#>
#> ****************
#> ****************
#> ****************
#>
#> MSPE (LOSS V): 0.005105562
#>
#> solution.v:
#> 0.03795838 0.02953412 0.03356642 0.01533716 0.1226315 0.1285906 0.05816525 0.02318678 0.01465216 0.01080646 0.06187415 0.0289542 0.01702719 0.08006876 0.009607601 0.01627082 0.1278952 0.02615566 0.01342692 0.04431671 0.04468165 0.01097563 0.04431671
#>
#> solution.w:
#> 1.1452e-05 0.0002666085 0.0001873182 0.0002686277 0.0001636778 0.0003347625 0.0004905744 0.0005939929 0.0005203545 0.5502766 4.8739e-06 0.0007708196 0.0003844661 0.001002508 0.000803214 0.000999687 0.1285913 0.3143298
The output of the function is a list with 4 tables.
These are the 4 tables that are returned by the function call.
synthetic_control_composition
This table summarizes the results related to the unit selection from the Synthetic Control method. The columns are the following:
kable(synthetic_forecast$synthetic_control_composition)
execution_date | projected_unit | projected_serie | synthetic_units | w.weights |
---|---|---|---|---|
2022-03-08 | 30 | x1 | 10 | 0.550 |
2022-03-08 | 30 | x1 | 18 | 0.314 |
2022-03-08 | 30 | x1 | 17 | 0.129 |
2022-03-08 | 30 | x1 | 8 | 0.001 |
2022-03-08 | 30 | x1 | 9 | 0.001 |
2022-03-08 | 30 | x1 | 12 | 0.001 |
2022-03-08 | 30 | x1 | 14 | 0.001 |
2022-03-08 | 30 | x1 | 15 | 0.001 |
2022-03-08 | 30 | x1 | 16 | 0.001 |
execution_date
: The date that the forecast was executed in the YYYY-MM-DD format;projected_unit
: The forcasted unit;projected_serie
: The forecasted serie;synthetic_units
/w.weights
: the units (from 18
to 1
) selected and their recpective weights.variable_importance_and_comparison
This table summarizes the results related to the features/variables selection from the Synthetic Control method. The columns are the following:
execution_date | projected_unit | projected_serie | variable | unit_of_interest | synthetic | sample | v.weights |
---|---|---|---|---|---|---|---|
2022-03-08 | 30 | x1 | x8 | 0.474 | 0.550 | 0.567 | 0.129 |
2022-03-08 | 30 | x1 | x20 | 0.537 | 0.613 | 0.630 | 0.128 |
2022-03-08 | 30 | x1 | x7 | 0.110 | 0.099 | 0.088 | 0.123 |
2022-03-08 | 30 | x1 | x16 | 0.289 | 0.192 | 0.168 | 0.080 |
2022-03-08 | 30 | x1 | x13 | 0.237 | 0.145 | 0.116 | 0.062 |
2022-03-08 | 30 | x1 | x9 | 0.443 | 0.441 | 0.433 | 0.058 |
2022-03-08 | 30 | x1 | x25 | 0.517 | 0.370 | 0.317 | 0.045 |
2022-03-08 | 30 | x1 | x24 | 0.729 | 0.709 | 0.699 | 0.044 |
execution_date
: The date that the forecast was executed in the YYYY-MM-DD format;projected_unit
: The forcasted unit;projected_serie
: The forecasted serie;variable
: The variable selected;unit_of_interest
: The mean value over time of the variable in column variable
from the unit in the projected_unit
;synthetic
: The mean value over time of the variable in column variable
of the syntehtic unit;sample
: The mean value over time of the variable in column variable
of the whole dataset;v.weights
: The weight of the variable in the column variable
.mape_backtest
This table depicts the results of a simple mape back test on the period it was used to forecast. It is worth noting that the intention is not to provide a robust method for validation the model. The Synthetic Control Method is a mathematical approach, not an machine learning, that minimizes the distance without worrying about overfitting the curves. The columns are the following:
kable(synthetic_forecast$mape_backtest)
execution_date | projected_unit | projected_serie | max_time_unit_of_interest | periods_to_forecast | elegible_control_units | number_control_units | mape |
---|---|---|---|---|---|---|---|
2022-03-08 | 30 | x1 | 21 | 12 | 17 | 9 | 13.00928 |
execution_date
: The date that the forecast was executed in the YYYY-MM-DD format;projected_unit
: The forcasted unit;projected_serie
: The forecasted serie;max_time_unit_of_interest
: The age of the unit of interest;periods_to_forecast
: Periods that were forecasted;elegible_control_units
: Number of elegible units to be used to forecast;mape
: The mean absolute percentage error in the from 1 to max_time_unit_of_interest
.output_projecao
This tables contains the projection itself. The columns are the following:
kable(synthetic_forecast$output_projecao)
execution_date | projected_unit | time_period | projected_serie_value | is_projected | projected_serie |
---|---|---|---|---|---|
2022-03-08 | 30 | 1 | 0.4354680 | 0 | x1 |
2022-03-08 | 30 | 2 | 0.4321821 | 0 | x1 |
2022-03-08 | 30 | 3 | 0.5256354 | 0 | x1 |
2022-03-08 | 30 | 4 | 0.4840789 | 0 | x1 |
2022-03-08 | 30 | 5 | 0.3801790 | 0 | x1 |
2022-03-08 | 30 | 6 | 0.2640425 | 0 | x1 |
2022-03-08 | 30 | 7 | 0.1495329 | 0 | x1 |
2022-03-08 | 30 | 8 | 0.2581808 | 0 | x1 |
2022-03-08 | 30 | 9 | 0.2937315 | 0 | x1 |
2022-03-08 | 30 | 10 | 0.3000216 | 0 | x1 |
2022-03-08 | 30 | 11 | 0.3381660 | 0 | x1 |
2022-03-08 | 30 | 12 | 0.3035805 | 0 | x1 |
2022-03-08 | 30 | 13 | 0.2989308 | 0 | x1 |
2022-03-08 | 30 | 14 | 0.6051545 | 0 | x1 |
2022-03-08 | 30 | 15 | 0.3462337 | 0 | x1 |
2022-03-08 | 30 | 16 | 0.3895760 | 0 | x1 |
2022-03-08 | 30 | 17 | 0.4199159 | 0 | x1 |
2022-03-08 | 30 | 18 | 0.4777851 | 0 | x1 |
2022-03-08 | 30 | 19 | 0.5354843 | 0 | x1 |
2022-03-08 | 30 | 20 | 0.4860005 | 0 | x1 |
2022-03-08 | 30 | 21 | 0.4963447 | 0 | x1 |
2022-03-08 | 30 | 22 | 0.4928737 | 1 | x1 |
2022-03-08 | 30 | 23 | 0.4534551 | 1 | x1 |
2022-03-08 | 30 | 24 | 0.4750725 | 1 | x1 |
2022-03-08 | 30 | 25 | 0.4928884 | 1 | x1 |
2022-03-08 | 30 | 26 | 0.7005200 | 1 | x1 |
2022-03-08 | 30 | 27 | 0.3911140 | 1 | x1 |
2022-03-08 | 30 | 28 | 0.4438282 | 1 | x1 |
2022-03-08 | 30 | 29 | 0.4673172 | 1 | x1 |
2022-03-08 | 30 | 30 | 0.4722184 | 1 | x1 |
2022-03-08 | 30 | 31 | 0.4898868 | 1 | x1 |
2022-03-08 | 30 | 32 | 0.5014260 | 1 | x1 |
2022-03-08 | 30 | 33 | 0.4313357 | 1 | x1 |
execution_date
: The date that the forecast was executed in the YYYY-MM-DD format;projected_unit
: The forcasted unit;time_period
: The time period;projected_serie
: The forecasted serie;projected_serie_value
: The value of the seria/variable that was projected, from colun projected_serie
;is_projected
: 1 indicates that the value is projected, 0 indicates that the value is observed.
proj<- synthetic_forecast$output_projecao
proj %>% glimpse()
#> Rows: 33
#> Columns: 6
#> $ execution_date <chr> "2022-03-08", "2022-03-08", "2022-03-08", "2022-…
#> $ projected_unit <chr> "30", "30", "30", "30", "30", "30", "30", "30", …
#> $ time_period <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
#> $ projected_serie_value <dbl> 0.4354680, 0.4321821, 0.5256354, 0.4840789, 0.38…
#> $ is_projected <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ projected_serie <chr> "x1", "x1", "x1", "x1", "x1", "x1", "x1", "x1", …