In this series of blog posts I will try to summarise the progress so far, and highlight ways that you can use it. This first post will summarise the papers so far and give an overview of the main findings. Later posts will focus on explaining how MTA works.

The key points behind MTA are the following:

- It is a radically different approach to time series modelling, recognising that the data sampling frequency may not be the best for a given modelling purpose.
- A time series is modelled simultaneously at multiple temporal aggregation levels that can be easily generated from the original data. At each level an appropriate model is fit, focusing on the components of the series that are strengthened by temporal aggregation.
- If forecasting is the objective, then the produced forecast reconciles the information from all these models. This makes the forecast robust to modelling uncertainty and lessens the importance of model selection.
- The resulting forecasts have been shown to be reliable and typically outperform the conventional modelling approach.

Table 1 summarises our contributions on MTA so far (follow the links to access the papers). We have also released two R packages that implement MTA: MAPA and thief. The former implements, as the name suggests, MAPA, while the latter provides code to use Temporal Hierarchies.

Paper | Summary |
---|---|

Kourentzes et al. 2014. Improving forecasting by estimating time series structural components across multiple frequencies. | The initial paper on MTA modelling. It introduces the Multiple Aggregation Prediction Algorithm (MAPA) and demonstrates its superior performance on the well-known M3 competition. |

Petropoulos and Kourentzes 2014. Forecast combinations for intermittent demand. | Expands MAPA for the case of intermittent demand. |

Kourentzes and Petropoulos 2016. Forecasting with multivariate temporal aggregation: The case of promotional modelling. | Expands MAPA for promotional modelling purposes at Stock Keeping Unit level. |

Barrow and Kourentzes 2016. Distributions of forecasting errors of forecast combinations: implications for inventory management. | Provides evidence of very strong performance of MAPA over established benchmarks for demand forecasting and inventory management purposes. |

Athanasopoulos et al. 2017. Forecasting with temporal hierarchies. | Introduces a general framework for MTA: Temporal Hierarchies that allows use of any model/method to produce forecasts at each level. |

Kourentzes et al. 2017. Demand forecasting by temporal aggregation: using optimal or multiple aggregation levels? | Demonstrates that MTA modelling is more robust to uncertainty than modelling either using the original data or using a single (optimal) temporal aggregation level. |

To give you an idea of the reported improvements, I have collated some of the results from the papers above. The best forecast in each column, in all tables, is highlighted in boldface. Table 2 provides a summary for the quarterly and monthly M3 datasets, using as benchmarks the Exponential Smoothing (ETS) family of models, with automatic model selection (via AICc), and Theta, the best performing method on the original M3 competition – a position it held for almost 15 years! In this case both MAPA and Temporal Hierarchies make use of the ETS family of models, so you can get a feeling of the improvement provided by MTA over conventional time series forecasting, as the results are directly comparable with the ETS row.

Tables 3 and 4 provide results for a number of real datasets. Table 4 also provides results on a variety of simulated ARIMA series. The detailed results can be found in the respective papers. In all cases MAPA is better, or at least as good, compared to the various benchmarks. Table 5 provides results on real series that have promoted periods. There are two comparisons: forecasts without and with promotional information. In both cases MTA based forecasts (MAPA) are on average the most accurate.

Forecast | Quarterly set | Monthly set |
---|---|---|

Exponential Smoothing (ETS) | 9.94% | 14.45% |

Theta (M3 competition)^{2} |
8.96% |
13.85% |

MAPA (Kourentzes et al. 2014) | 9.58% | 13.69% |

Temporal Hierarchies (Athanasopoulos et al. 2017) | 9.70% | 13.61% |

Forecast | 1-step ahead | 3-steps ahead | 5-steps ahead |
---|---|---|---|

Naive | 0.882 | 0.900 | 0.919 |

ETS | 0.677 | 0.688 | 0.711 |

AR | 0.707 | 0.719 | 0.737 |

ARIMA | 1.446 | 0.701 | 0.721 |

Theta | 0.674 | 0.685 | 0.705 |

MAPA | 0.668 |
0.670 |
0.687 |

Forecast | Simulated ARIMA | Manaufacturing | Call centre |
---|---|---|---|

Single Exponential Smoothing (SES) | 1.000 | 1.000 | 1.000 |

Exponential Smoothing (ETS) | 0.985 | 1.011 | 1.005 |

Optimal Temporal Aggregation & SES | 0.974 | 0.999 | 1.080 |

MAPA | 0.971 |
0.994 |
0.979 |

Forecast | 4-steps ahead | 8-steps ahead | 12-steps ahead |
---|---|---|---|

Naive | 0.743 | 0.818 | 0.704 |

ETS | 0.704 | 0.774 | 0.701 |

MAPA | 0.679 | 0.754 | 0.736 |

Regression + Promotional | 0.611 | 0.659 | 0.714 |

ETS + Promotional | 0.642 | 0.627 | 0.543 |

MAPA + Promotional | 0.525 |
0.521 |
0.515 |

The main argument in all papers is that MTA helps to improve forecast accuracy due to the way it mitigates modelling uncertainty. As we will see this comes at no additional data cost and relatively limited additional computations. An added benefit, which is not very evident from the summarised tables provided here, is that the MTA forecasts are reliable both for short and long term forecasting, providing a way to reconcile operational, tactical and strategic planning.

Unpublished results on different applications provide a similar picture in terms of accuracy. There is also evidence that MTA can strengthen statistical tests, as the initial results of this experiment show. However, all this is ongoing research, so until a full analysis is conducted and the results are peer reviewed, I would add a pinch of salt to these!

In following blog posts I will explain how MTA works and elaborate more on results from the various papers.

]]>Recent advances have demonstrated the benefits of temporal aggregation for demand forecasting, including increased accuracy, improved stock control and reduced modelling uncertainty. With temporal aggregation a series is transformed, strengthening or attenuating different elements and thereby enabling better identification of the time series structure. Two different schools of thought have emerged. The first focuses on identifying a single optimal temporal aggregation level at which a forecasting model maximises its accuracy. In contrast, the second approach fits multiple models at multiple levels, each capable of capturing different features of the data. Both approaches have their merits, but so far they have been investigated in isolation. We compare and contrast them from a theoretical and an empirical perspective, discussing the merits of each, comparing the realised accuracy gains under different experimental setups, as well as the implications for business practice. We provide suggestions when to use each for maximising demand forecasting gains.

Download paper.

R package (MAPA). Code for optimal aggregation level: function `get.opt.k`

in TStools package.

To understand more about this, I set up a simple experiment to collect evidence how humans perceive trends. The experiment below asks you to distinguish between trended and non-trended time series. **Every 10 time series** that you will assess it will provide you with some statistics on your accuracy and the accuracy of some statistical tests (by no means an exhaustive list!). It also provides overall statistics from all participants so far. As you can see, it is no so trivial to identify correctly the presence of trend! What do you think, can you better than the average performance so far?

This paper introduces the concept of Temporal Hierarchies for time series forecasting. A temporal hierarchy can be constructed for any time series by means of non-overlapping temporal aggregation. Predictions constructed at all aggregation levels are combined with the proposed framework to result in temporally reconciled, accurate and robust forecasts. The implied combination mitigates modelling uncertainty, while the reconciled nature of the forecasts results in a unified prediction that supports aligned decisions at different planning horizons: from short-term operational up to long-term strategic planning. The proposed methodology is independent of forecasting models. It can embed high level managerial forecasts that incorporate complex and unstructured information with lower level statistical forecasts. Our results show that forecasting with temporal hierarchies increases accuracy over conventional forecasting, particularly under increased modelling uncertainty. We discuss organisational implications of the temporally reconciled forecasts using a case study of Accident & Emergency departments.

Download paper.

R package (thief).

]]>`nnetar`

function in the `forecast`

package, written by Rob Hyndman. In my view there is space for a more flexible implementation, so I decided to write a few functions for that purpose. For now these are included in the TStools package that is available in GitHub, but when I am happy with their performance and flexibility I will put them in a package of their own.
Here I will provide a quick overview of what these is available right now. I plan to write a more detailed post about these functions when I get the time.

For this example I will model the `AirPassengers`

time series available in R. I have kept the last 24 observations as a test set and will use the rest to fit the neural networks. Currently there are two types of neural network available, both feed-forward: (i) multilayer perceptrons (use function `mlp`

); and extreme learning machines (use function `elm`

).

```
# Fit MLP
mlp.fit <- mlp(y.in)
plot(mlp.fit)
print(mlp.fit)
```

This is the basic command to fit an MLP network to a time series. This will attempt to automatically specify autoregressive inputs and any necessary pre-processing of the time series. With the pre-specified arguments it trains 20 networks which are used to produce an ensemble forecast and a single hidden layer with 5 nodes. You can override any of these settings. The output of `print`

is a summary of the fitted network:

MLP fit with 5 hidden nodes and 20 repetitions. Series modelled in differences: D1. Univariate lags: (1,3,4,6,7,8,9,10,12) Deterministic seasonal dummies included. Forecast combined using the median operator. MSE: 6.2011.

As you can see the function determined that level differences are needed to capture the trend. It also selected some autoregressive lags and decided to also use dummy variables for the seasonality. Using `plot`

displays the architecture of the network (Fig. 1).

The light red inputs represent the binary dummies used to code seasonality, while the grey ones are autoregressive lags. To produce forecasts you can type:

mlp.frc <- forecast(mlp.fit,h=tst.n) plot(mlp.frc)

Fig. 2 shows the ensemble forecast, together with the forecasts of the individual neural networks. You can control the way that forecasts are combined (I recommend using the median or mode operators), as well as the size of the ensemble.

You can also let it choose the number of hidden nodes. There are various options for that, but all are computationally expensive (I plan to move the base code to CUDA at some point, so that computational cost stops being an issue).

```
# Fit MLP with automatic hidden layer specification
mlp2.fit <- mlp(y.in,hd.auto.type="valid",hd.max=10)
print(round(mlp2.fit$MSEH,4))
```

This will evaluate from 1 up to 10 hidden nodes and pick the best on validation set MSE. You can also use cross-validation (if you have patience…). You can ask it to output the errors for each size:

MSE H.1 0.0083 H.2 0.0066 H.3 0.0065 H.4 0.0066 H.5 0.0071 H.6 0.0074 H.7 0.0061 H.8 0.0076 H.9 0.0083 H.10 0.0076

There are a few experimental options in specifying various aspects of the neural networks, which are not fully documented and is probably best if you stay away from them for now!

ELMs work pretty much in the same way, although for these I have made default the automatic specification of the hidden layer.

```
# Fit ELM
elm.fit <- elm(y.in)
print(elm.fit)
plot(elm.fit)
```

This gives the following network summary:

ELM fit with 100 hidden nodes and 20 repetitions. Series modelled in differences: D1. Univariate lags: (1,3,4,6,7,8,9,10,12) Deterministic seasonal dummies included. Forecast combined using the median operator. Output weight estimation using: lasso. MSE: 83.0044.

I appreciate that using 100 hidden nodes on such a short time series can make some people uneasy, but I am using a shrinkage estimator instead of conventional least squares to estimate the weights, which in fact eliminates most of the connections. This is apparent in the network architecture in Fig. 3. Only the nodes connected with the black lines to the output layer contribute to the forecasts. The remaining connection weights have been shrunk to zero.

Another nice thing about these functions is that you can call them from the thief package, which implements Temporal Hierarchies forecasting in R. You can do that in the following way:

```
# Use THieF
library(thief)
mlp.thief <- thief(y.in,h=tst.n,forecastfunction=mlp.thief)
```

There is a similar function for using ELM networks: `elm.thief`

.

Since for this simple example I kept some test set, I benchmark the forecasts against exponential smoothing:

Method | MAE |
---|---|

MLP (5 nodes) | 62.471 |

MLP (auto) | 48.234 |

ELM | 48.253 |

THieF-MLP | 45.906 |

ETS | 64.528 |

Temporal hierarchies, like MAPA, are great for making your forecasts more robust and often more accurate. However, with neural networks the additional computational cost is evident!

These functions are still in development, so the default values may change and there are a few experimental options that may give you good results or not!

]]>We propose a forecasting method to improve accuracy for tactical sales predictions at a major supplier to the tire industry. This level of forecasting serves as direct input for the demand planning, steering the global supply chain and is typically up to a year ahead. The case company has a product portfolio that is strongly sensitive to external events. Univariate statistical methods, which are common in practice, are unable to anticipate and forecast changes in the market, while human expert forecasts are known to be biased and inconsistent. The proposed method is able to automatically identify key leading indicators that drive sales from a massive set of macro-economic indicators, across different regions and markets and produce accurate forecasts. Our method is able to handle the additional complexity of the short and long term dynamics from the product sales and the external indicators. We find that accuracy is improved by 16.1% over current practice with proportional benefits for the supply chain. Furthermore, our method provides transparency to the market dynamics, allowing managers to better understand the events and economic variables that affect the sales of their products.

Download paper.

]]>**1. Multilayer Perceptron (MLP) neural networks**

MLPs are a basic form of neural networks. Having a good understanding of these can help one understand most types of neural networks, as typically other types are constructed by adding more connections (such as feedbacks or skip-layer/direct connections). Let us assume that we have three different inputs, (X_{1}, X_{2}, X_{3}), which could be different variables or lags of the target variables. A MLP with a single hidden layer, with 5 hidden nodes and a single output layer can be visualised as in Fig. 1.

An input (for example *X _{1}*) is passed and processed through all 5 hidden nodes (

, (1)

where

The transfer function *f()* is typically either the logistic sigmoid or the hyperbolic tangent for regression problems. The output node typically uses a linear transfer function, acting as a conventional linear regression. To really understand how the input values are transformed to the network output, we need to understand how a single neuron functions.

**2. Neurons**

Consider a neuron as a nonlinear regression of the form (for the example with 3 inputs):

. (2)

If *f()* is the identity function, then (2) becomes a conventional linear regression. If *f()* is nonlinear then the magic starts! Depending on the values of the weights *a _{j,i}* and the constant

- the type of transfer function;
- the values of the input, weight and constant.

The first plot shows the input-output values, the plot of the transfer function and with cyan background the area of values that can be considered by the neuron given selected weight and constant. The second plot provides a view of the neuron function, given the transfer function, weight and constant. Observe that the weight controls the width of the neuron and the constant the location, along the transfer function.

What is quite important to note here is that both logistic sigmoid and hyperbolic tangent squash the input between two values and the output cannot increase or decrease indefinitely, as with the linear. Also the combination of weight and constant can result in different forms of nonlinearities or approximate linear behaviours. As a side note, although I do not see MLP as anything to do with simulating biological networks, the sigmoid-type transfer functions are partly inspired by the stimulated or not states of biological neurons.

By now two things should become evident:

- The scale of the inputs is very important for neural networks, as very large or small values result in the same constant output, essentially acting at the bounds of the neuron plots above. Although in theory it is possible to achieve the desired scaling using only appropriate weights and constants, training of networks is aided tremendously by scaling the inputs to a reasonable range, often close to [-1,1].
- With sigmoid type transfer functions it is impossible to achieve an ever increasing/decreasing range of outputs. So for example if we were to use as an input a vector (1, 2, 3, 4, 5, 6, …, n) the output would be squashed between [0, 1] or [-1, 1] depending on the transfer function, irrespectively of how large
*n*is.

Of course, as Eq. (1) suggests, in a neural network the output of a neuron is multiplied by a weight and shifted by a constant, so it is relatively easy to achieve output values much greater than the bounds of a single neuron. Nonetheless, a network will still “saturate” and reach a minimum/maximum value and cannot decrease/increase perpetually, unless non-squashing neurons are used as well (this is for example a case where direct connections to a linear output become useful). An example of this follows.

Suppose we want to predict the future values of a deterministic upward trend with no noise, of the form: y_{t} = x_{t} and x_{t} = (1, 2, 3, 4, …). We scale the observed values between [-1, 1] to facilitate the training of the neural network. We use only 80% of the values for training the network and the remaining 20% to test the performance of the forecasts (test set A). We train a network with 5 logistic hidden nodes and a single linear output. Fig. 2 provides the resulting network with the trained weights and constants.

The single input (scaled x_{t}) is fed to all five nodes. Observe that it is multiplied with different weights (black numbers) and shifted by different constants (blue numbers) at each node. When additional inputs are used, the inherent difficulty in interpreting all these weights together, makes neural networks to be considered as black boxes. Fig. 3 provides the observed y_{t} and predicted neural network values. The network is able to provide a very good fit in the training set and for most of test set A, but as the values increase (test set B) we can see that the networks starts to saturate (the individual nodes reach the upper bounds of the values they can output and eventually the whole network) and the predicted trend tapers off. As we saw earlier, each sigmoid-type node has a maximum value it can output.

This raises a significant doubt whether neural networks can forecast trended time series, if they are unable to model such an easy case. One would argue that with careful scaling of data (see good fit in test set A) it is possible to predict trends, but that implies that one knows the range that the future values would be in, to accommodate them with appropriate scaling. This information is typically unknown, especially when the trend is stochastic in nature.

**3. Forecasting trends**

Although forecasting trends is problematic when using raw data, we can pre-process the time series to enable successful modelling. We can remove any trend through differencing. Much like with ARIMA modelling, we overcome the problem of requiring the network to provide ever increasing/decreasing values and therefore we can model such series. For example, considering one of the yearly M3 competition series we can produce the following forecast:

Fig. 4 provides the actuals and forecasts after differencing and scaling is applied, the forecast is produced and subsequently differencing and scaling are reversed. However there are some limitations to consider:

- This approach implies a two stage model, where first
*z*is constructed and then_{t }= y_{t }– y_{t-1 }*z*is modelled using neural networks. This imposes a set of modelling constraints that may be inappropriate._{t} - The neural network is capable of capturing nonlinearities. However if such nonlinearities are connected to the level, for example multiplicative seasonality, then by using differences we are making it very difficult for the network to approximate the underlying functional form.
- Differencing implies stochastic trend, which in principle is inappropriate when dealing with deterministic trend.

Therefore, it is fair to say that differencing is useful, but is by no means the only way to deal with trends and surely not always the best option. However, it is useful to understand how sigmoid-type neurons and networks are bound to fail in modelling raw trended time series. There have been several innovations in neural networks for forecasting, but most are bound by this limitation due to the transfer functions used.

So, can neural networks forecast trended time series? Fig. 4 suggests yes, but how to best do it is still an open question. Past research that I have been part of has shown that using differences is reliable and effective (for example see the specifications of neural networks here and here), even though there are unresolved problems with differencing. Surely just expecting the network to “learn” to forecasts trends is not enough.

]]>A typical premise of forecasting model building is that a good fit in the past should lead to a reasonable forecast in the future. Implicitly we assume that there is some underlying data generating process, which we hopefully approximate well enough, to forecast it in the future. **I do not believe that we can identify the true process for real data** – I have come to doubt even if a true process exists, though this is a philosophical point. Here I will not make a discussion on how to avoid overfitting (for a related discussion see this post). Instead I will provide a slightly unusual example that I built for my students to exemplify the difference between in-sample fitting and forecasting.

For this example, I will not use a business time series, but music. We can extract from a song its waveform and analyse it as a time series, and even try to forecast it. There is a catch though, there is an inherent causality in music. It typically follows some logical structure, harmony and so on that a standard forecasting model is unaware of. For example, ARIMA or exponential smoothing, will attempt to find structure and repeating patterns in the past and extrapolate them in the future, unaware of any linguistics or musical constructs.

Let us take a song, sample its first 10 seconds, at 11,025 observations per second and fit an adequate ARIMA. Using standard unit root testing and AICc we identify an ARMA(5,0,4) as the best model. Note that the exact model used does not make much difference for this example, and different order ARIMA would provide more or less the same conclusions.

Using this model we forecast the next 5 seconds. We can assess the quality of the model fit and the forecast visually, but also as sound. Does the model fit sound anything like the original song? Does the forecast sound anything like the original song? Here it is a good point to warn you that I have chosen a sufficiently interesting song.

Fig. 1 provides a plot of the 1000 last historical observations and the ARMA fit, as well as the forecast for the next 500 observations, where we can easily evaluate the quality of the model fit.

We can observe that the model fit is relatively good, yet the forecast is very weak. In fact, the forecast quickly reverts to the mean and the prediction intervals “contain” all the song. This already makes it quite clear that a good fit does not mean a good forecast. The difference between the quality of the in-sample fit and the out-of-sample forecast is striking when “heard”.

The original sample:

The model fit:

The two sound samples are sufficiently similar. In fact, one can hear that the model fit is missing some of the higher pitches, which can be also seen in Fig. 1. Nonetheless, the fit sound sample is arguably retaining enough of the original song to be recognisable.

The forecast sounds like:

As seen in Fig. 1. the forecast very quickly becomes a flat line, which results in silence.

It is clear that although the model fit captures adequately the song, the forecast is useless. Why is this the case? The modelling methodology I followed here is geared towards good in-sample fit, rather than forecasting. Arguably the model family I used is also inappropriate. However, neither of these becomes apparent by consulting only the in-sample fit.

Generalising, consulting conventional in-sample statistics such as R^{2}, mean squared error, etc. is not only useless, but also dangerous, when the model building purpose is forecasting. Consulting penalised statistics, such as information criteria (AIC, BIC, etc.) is useful, but again care should be taken. In our example the best model was chosen using AICc and it still did not produce any useful forecast. Business forecasting is not that dissimilar from the example here.

The reason that forecasting fails here is that there was no consideration during model fitting to identify the key drivers of the time series, in this case the musical and linguistic structure (though there are no lyrics in the sample used here). Instead, a generic extrapolative model was used. Business series often have similar complexities, such as promotions, price changes, events, and so on. These should be part of the model building exercise. Simply flagging observations as outlying, or merely identifying and fitting the “best” model from a limited pool of alternatives is a dangerous strategy. What I often argue to my students is that a good modeller should appreciate the problem context and use it! In-sample statistics are woefully inadequate in doing this, especially when forecasting is the objective.

A final note: all rights for the song are with the original holders. I am in no way a composer myself and the song is not my work! Can anyone recognise which song it is?

]]>- temporal aggregation and temporal hierarchies;
- promotional modelling at SKU level;
- use of online search data for lifecycle modelling.

You can download the slides here.

]]>