Author Archives: Nikos

Multiple temporal aggregation: the story so far. Part II

The effects of temporal aggregation

In this post I will demonstrate the effects of temporal aggregation and motivate the use of multiple temporal aggregation (MTA). I will not delve into the econometric aspects of the discussion, but it is worthwhile to summarise key findings from the literature. A concise forecasting related summary is available in our recent paper Athanasopoulos et al. (2017), section 2:

  • Temporal aggregation changes the (identifiable) structure of the time series;
  • As the aggregation level increases there are less components that appear and higher-frequency components (for example, seasonality and promotions) become weaker or vanish altogether;
  • Temporal aggregation reduces the sample size resulting in loss of estimation efficiency. To make this simple, if you have 4 years of monthly data and you aggregate your series to a yearly level you will have to build a model with only for four data point, risky!
  • There are accuracy gains to be had, but identifying the (single) appropriate temporal aggregation level is very difficult! Yet, it still simplifies some problems, like intermittent demand forecasting.

What do these mean for our forecasts? Well, if you work on the basis that the true model is an elusive idea, these are not too prescriptive for constructing forecasts. I will try to give you an intuition visually. In the following interactive visualisation you can choose between different time series and plot the original and temporally aggregated data, together with a seasonal plot. The seasonal plot will be shown only when it is feasible, i.e. the resulting seasonality after the aggregation has an integer period greater than 1. For each series I also fit an appropriate exponential smoothing model (selected using AICc) and provide a list of the fitted components for all temporal aggregation levels, up to yearly data. I also provide the relevant forecast. Observe a few things:

  • The identified exponential smoothing models are often different across temporal aggregation levels. In particular the seasonality is filtered as we aggregate into bigger time buckets. Of course, for some aggregation levels (for example, aggregate every 5-months) the resulting series has a non-integer seasonal period and typical forecasting methods cannot capture it and instead it contributes to the error part;
  • Other aspects of the series, like outliers, vanish as we aggregate to higher levels;
  • Some times aggregation makes the time series easier to model, and sometimes it over-smooths the series! The forecasts surely vary a lot as we aggregate.

At minimum we can say that temporal aggregation alters the identifiable parts of the time series, strengthening low-frequency components (such as trend), while weakening high-frequency components (such as seasonality). Depending on the forecasting objective, this may result in better forecasts, especially if we are aiming at long term forecasts. Furthermore, simply because the temporal aggregation filters part of the noise (it is a moving average filter!) it may just be better to model a series at a more aggregate level.

The main problem in the literature is that it is very difficult to know what is the optimal temporal aggregation level, which will maximise your forecast accuracy, for real data. This is not a trivial point: There are theoretical solutions suggesting the optimal temporal aggregation level for various data generation processes, but they rely on full knowledge of the process! Well, if I knew the process, then forecasting it would be trivial. Recent research showed that although we can easily show benefits on simulated data, it becomes much more complicated with real data that the true model is unknown.

If we connect the dots, there are four key arguments in favour of MTA (discussed in more detail in these three papers [1], [2] and [3]):

  • Because we are provided with a time series sampled at some time interval, we do not have to model it at that level! It may be better to do so at some aggregate level.
  • Temporal aggregation can be beneficial for forecasting, but identifying a single optimal level of aggregation is very challenging, so why not use multiple?
  • Using multiple levels we avoid relying on a single forecasting model, therefore we mitigate modelling uncertainty by considering multiple (different) models across temporal aggregation levels.
  • Holistic modelling of the time series information: Models built on the original data or on low aggregation levels can focus more on high frequency components, while models build on high aggregation levels focus on low frequency components, which may not be easy to capture in the originally sampled time series.

All these points suggest that using Multiple Temporal Aggregation levels should be useful, but we have not yet addressed the question how to do this! I will introduce our first attempt to do this, the Multiple Temporal Aggregation Algorithm (MAPA) in the next post in the series.

Multiple Temporal Aggregation: the story so far: Part I; Part II.


Multiple temporal aggregation: the story so far. Part I

Over the last years I have been working (with my co-authors!) on the idea of Multiple Temporal Aggregation (MTA) for time series forecasting. A number of papers have been published introducing and developing the idea further, or testing its effectiveness for forecasting.

In this series of blog posts I will try to summarise the progress so far, and highlight ways that you can use it. This first post will summarise the papers so far and give an overview of the main findings. Later posts will focus on explaining how MTA works.

The key points behind MTA are the following:

  • It is a radically different approach to time series modelling, recognising that the data sampling frequency may not be the best for a given modelling purpose.
  • A time series is modelled simultaneously at multiple temporal aggregation levels that can be easily generated from the original data. At each level an appropriate model is fit, focusing on the components of the series that are strengthened by temporal aggregation.
  • If forecasting is the objective, then the produced forecast reconciles the information from all these models. This makes the forecast robust to modelling uncertainty and lessens the importance of model selection.
  • The resulting forecasts have been shown to be reliable and typically outperform the conventional modelling approach.

Table 1 summarises our contributions on MTA so far (follow the links to access the papers). We have also released two R packages that implement MTA: MAPA and thief. The former implements, as the name suggests, MAPA, while the latter provides code to use Temporal Hierarchies.

Table 1. Papers on MTA
Paper Summary
Kourentzes et al. 2014. Improving forecasting by estimating time series structural components across multiple frequencies. The initial paper on MTA modelling. It introduces the Multiple Aggregation Prediction Algorithm (MAPA) and demonstrates its superior performance on the well-known M3 competition.
Petropoulos and Kourentzes 2014. Forecast combinations for intermittent demand. Expands MAPA for the case of intermittent demand.
Kourentzes and Petropoulos 2016. Forecasting with multivariate temporal aggregation: The case of promotional modelling. Expands MAPA for promotional modelling purposes at Stock Keeping Unit level.
Barrow and Kourentzes 2016. Distributions of forecasting errors of forecast combinations: implications for inventory management. Provides evidence of very strong performance of MAPA over established benchmarks for demand forecasting and inventory management purposes.
Athanasopoulos et al. 2017. Forecasting with temporal hierarchies. Introduces a general framework for MTA: Temporal Hierarchies that allows use of any model/method to produce forecasts at each level.
Kourentzes et al. 2017. Demand forecasting by temporal aggregation: using optimal or multiple aggregation levels? Demonstrates that MTA modelling is more robust to uncertainty than modelling either using the original data or using a single (optimal) temporal aggregation level.

To give you an idea of the reported improvements, I have collated some of the results from the papers above. The best forecast in each column, in all tables, is highlighted in boldface. Table 2 provides a summary for the quarterly and monthly M3 datasets, using as benchmarks the Exponential Smoothing (ETS) family of models, with automatic model selection (via AICc), and Theta, the best performing method on the original M3 competition – a position it held for almost 15 years! In this case both MAPA and Temporal Hierarchies make use of the ETS family of models, so you can get a feeling of the improvement provided by MTA over conventional time series forecasting, as the results are directly comparable with the ETS row.

Tables 3 and 4 provide results for a number of real datasets. Table 4 also provides results on a variety of simulated ARIMA series. The detailed results can be found in the respective papers. In all cases MAPA is better, or at least as good, compared to the various benchmarks. Table 5 provides results on real series that have promoted periods. There are two comparisons: forecasts without and with promotional information. In both cases MTA based forecasts (MAPA) are on average the most accurate.

Table 2. sMAPE results on M3 quarterly and monthly data1
Forecast Quarterly set Monthly set
Exponential Smoothing (ETS) 9.94% 14.45%
Theta (M3 competition)2 8.96% 13.85%
MAPA (Kourentzes et al. 2014) 9.58% 13.69%
Temporal Hierarchies (Athanasopoulos et al. 2017) 9.70% 13.61%

1 Papers provide results on more robust metrics!
2 Best performance in the original M3 competition.

Table 3. Scaled RMSE results on Fast Moving Consumer Goods sales (Barrow and Kourentzes, 2016)
Forecast 1-step ahead 3-steps ahead 5-steps ahead
Naive 0.882 0.900 0.919
ETS 0.677 0.688 0.711
AR 0.707 0.719 0.737
ARIMA 1.446 0.701 0.721
Theta 0.674 0.685 0.705
MAPA 0.668 0.670 0.687
Table 4. Average Relative MAE on simulated and real data (Kourentzes et al., 2017)
Forecast Simulated ARIMA Manaufacturing Call centre
Single Exponential Smoothing (SES) 1.000 1.000 1.000
Exponential Smoothing (ETS) 0.985 1.011 1.005
Optimal Temporal Aggregation & SES 0.974 0.999 1.080
MAPA 0.971 0.994 0.979
Table 5. Scaled MAE results on SKUs with promotions (Kourentzes and Petropoulos, 2016)
Forecast 4-steps ahead 8-steps ahead 12-steps ahead
Naive 0.743 0.818 0.704
ETS 0.704 0.774 0.701
MAPA 0.679 0.754 0.736
Regression + Promotional 0.611 0.659 0.714
ETS + Promotional 0.642 0.627 0.543
MAPA + Promotional 0.525 0.521 0.515

The main argument in all papers is that MTA helps to improve forecast accuracy due to the way it mitigates modelling uncertainty. As we will see this comes at no additional data cost and relatively limited additional computations. An added benefit, which is not very evident from the summarised tables provided here, is that the MTA forecasts are reliable both for short and long term forecasting, providing a way to reconcile operational, tactical and strategic planning.

Unpublished results on different applications provide a similar picture in terms of accuracy. There is also evidence that MTA can strengthen statistical tests, as the initial results of this experiment show. However, all this is ongoing research, so until a full analysis is conducted and the results are peer reviewed, I would add a pinch of salt to these!

In following blog posts I will explain how MTA works and elaborate more on results from the various papers.

Multiple Temporal Aggregation: the story so far: Part I; Part II.

Demand forecasting by temporal aggregation: using optimal or multiple aggregation levels?

N. Kourentzes, B. Rostami-Tabar and D.K. Barrow, 2017, Journal of Business Research.

Recent advances have demonstrated the benefits of temporal aggregation for demand forecasting, including increased accuracy, improved stock control and reduced modelling uncertainty. With temporal aggregation a series is transformed, strengthening or attenuating different elements and thereby enabling better identification of the time series structure. Two different schools of thought have emerged. The first focuses on identifying a single optimal temporal aggregation level at which a forecasting model maximises its accuracy. In contrast, the second approach fits multiple models at multiple levels, each capable of capturing different features of the data. Both approaches have their merits, but so far they have been investigated in isolation. We compare and contrast them from a theoretical and an empirical perspective, discussing the merits of each, comparing the realised accuracy gains under different experimental setups, as well as the implications for business practice. We provide suggestions when to use each for maximising demand forecasting gains.

Download paper.

R package (MAPA). Code for optimal aggregation level: function get.opt.k in TStools package.

Can you spot trend in time series?

Past experiments have demonstrated that humans (with or without formal training) are quite good at visually identifying the structure of time series. Trend is a key component, and arguably the most relevant to practice, as many of the forecasts that affect our lives have to do with potential increases or decreases of economic variables. Forecasters and econometricians often rely on formal statistical tests, while practitioners typically use their intuition, to assess whether a time series exhibits trend. There is fairly limited research contrasting these two. Furthermore, sometimes trend is understood with a rather vague definition. I do not think it is an exaggeration to suggest that even experts often can be confused on the exact definition (and effect) of a unit root and a trend.

To understand more about this, I set up a simple experiment to collect evidence how humans perceive trends. The experiment below asks you to distinguish between trended and non-trended time series. Every 10 time series that you will assess it will provide you with some statistics on your accuracy and the accuracy of some statistical tests (by no means an exhaustive list!). It also provides overall statistics from all participants so far. As you can see, it is no so trivial to identify correctly the presence of trend! What do you think, can you better than the average performance so far?

Forecasting with Temporal Hierarchies

G. Athanasopoulos, R. J. Hyndman, N. Kourentzes and F. Petropoulos, 2017, European Journal of Operational Research.

This paper introduces the concept of Temporal Hierarchies for time series forecasting. A temporal hierarchy can be constructed for any time series by means of non-overlapping temporal aggregation. Predictions constructed at all aggregation levels are combined with the proposed framework to result in temporally reconciled, accurate and robust forecasts. The implied combination mitigates modelling uncertainty, while the reconciled nature of the forecasts results in a unified prediction that supports aligned decisions at different planning horizons: from short-term operational up to long-term strategic planning. The proposed methodology is independent of forecasting models. It can embed high level managerial forecasts that incorporate complex and unstructured information with lower level statistical forecasts. Our results show that forecasting with temporal hierarchies increases accuracy over conventional forecasting, particularly under increased modelling uncertainty. We discuss organisational implications of the temporally reconciled forecasts using a case study of Accident & Emergency departments.

Download paper.

R package (thief).

Forecasting time series with neural networks in R

I have been looking for a package to do time series modelling in R with neural networks for quite some time with limited success. The only implementation I am aware of that takes care of autoregressive lags in a user-friendly way is the nnetar function in the forecast package, written by Rob Hyndman. In my view there is space for a more flexible implementation, so I decided to write a few functions for that purpose. For now these are included in the TStools package that is available in GitHub, but when I am happy with their performance and flexibility I will put them in a package of their own.

Here I will provide a quick overview of what these is available right now. I plan to write a more detailed post about these functions when I get the time.

For this example I will model the AirPassengers time series available in R. I have kept the last 24 observations as a test set and will use the rest to fit the neural networks. Currently there are two types of neural network available, both feed-forward: (i) multilayer perceptrons (use function mlp); and extreme learning machines (use function elm).

# Fit MLP <- mlp(

This is the basic command to fit an MLP network to a time series. This will attempt to automatically specify autoregressive inputs and any necessary pre-processing of the time series. With the pre-specified arguments it trains 20 networks which are used to produce an ensemble forecast and a single hidden layer with 5 nodes. You can override any of these settings. The output of print is a summary of the fitted network:

MLP fit with 5 hidden nodes and 20 repetitions.
Series modelled in differences: D1.
Univariate lags: (1,3,4,6,7,8,9,10,12)
Deterministic seasonal dummies included.
Forecast combined using the median operator.
MSE: 6.2011.

As you can see the function determined that level differences are needed to capture the trend. It also selected some autoregressive lags and decided to also use dummy variables for the seasonality. Using plot displays the architecture of the network (Fig. 1).

Fig. 1. Output of plot(

The light red inputs represent the binary dummies used to code seasonality, while the grey ones are autoregressive lags. To produce forecasts you can type:

mlp.frc <- forecast(,h=tst.n)

Fig. 2 shows the ensemble forecast, together with the forecasts of the individual neural networks. You can control the way that forecasts are combined (I recommend using the median or mode operators), as well as the size of the ensemble.

Fig. 2. Output of the plot function for the MLP forecasts.

You can also let it choose the number of hidden nodes. There are various options for that, but all are computationally expensive (I plan to move the base code to CUDA at some point, so that computational cost stops being an issue).

# Fit MLP with automatic hidden layer specification <- mlp(,"valid",hd.max=10)

This will evaluate from 1 up to 10 hidden nodes and pick the best on validation set MSE. You can also use cross-validation (if you have patience…). You can ask it to output the errors for each size:

H.1  0.0083
H.2  0.0066
H.3  0.0065
H.4  0.0066
H.5  0.0071
H.6  0.0074
H.7  0.0061
H.8  0.0076
H.9  0.0083
H.10 0.0076

There are a few experimental options in specifying various aspects of the neural networks, which are not fully documented and is probably best if you stay away from them for now!

ELMs work pretty much in the same way, although for these I have made default the automatic specification of the hidden layer.

# Fit ELM <- elm(

This gives the following network summary:

ELM fit with 100 hidden nodes and 20 repetitions.
Series modelled in differences: D1.
Univariate lags: (1,3,4,6,7,8,9,10,12)
Deterministic seasonal dummies included.
Forecast combined using the median operator.
Output weight estimation using: lasso.
MSE: 83.0044.

I appreciate that using 100 hidden nodes on such a short time series can make some people uneasy, but I am using a shrinkage estimator instead of conventional least squares to estimate the weights, which in fact eliminates most of the connections. This is apparent in the network architecture in Fig. 3. Only the nodes connected with the black lines to the output layer contribute to the forecasts. The remaining connection weights have been shrunk to zero.

Fig. 3. ELM network architecture.

Another nice thing about these functions is that you can call them from the thief package, which implements Temporal Hierarchies forecasting in R. You can do that in the following way:

# Use THieF
mlp.thief <- thief(,h=tst.n,forecastfunction=mlp.thief)

There is a similar function for using ELM networks: elm.thief.

Since for this simple example I kept some test set, I benchmark the forecasts against exponential smoothing:

Method MAE
MLP (5 nodes) 62.471
MLP (auto) 48.234
ELM 48.253
THieF-MLP 45.906
ETS 64.528

Temporal hierarchies, like MAPA, are great for making your forecasts more robust and often more accurate. However, with neural networks the additional computational cost is evident!

These functions are still in development, so the default values may change and there are a few experimental options that may give you good results or not!

Temporal Big Data for Tire Industry Tactical Sales Forecasting

Y.R. Sagaert, E-H.  Aghezzaf, N. Kourentzes and B. Desmet, 2017, Interfaces.

We propose a forecasting method to improve accuracy for tactical sales predictions at a major supplier to the tire industry. This level of forecasting serves as direct input for the demand planning, steering the global supply chain and is typically up to a year ahead. The case company has a product portfolio that is strongly sensitive to external events. Univariate statistical methods, which are common in practice, are unable to anticipate and forecast changes in the market, while human expert forecasts are known to be biased and inconsistent. The proposed method is able to automatically identify key leading indicators that drive sales from a massive set of macro-economic indicators, across different regions and markets and produce accurate forecasts. Our method is able to handle the additional complexity of the short and long term dynamics from the product sales and the external indicators. We find that accuracy is improved by 16.1% over current practice with proportional benefits for the supply chain. Furthermore, our method provides transparency to the market dynamics, allowing managers to better understand the events and economic variables that affect the sales of their products.

Download paper.

Can neural networks predict trended time series?

Yes and… no! First, I should say that I am thinking of the common types of neural networks that are comprised by neurons that use some type of sigmoid transfer function, although the arguments discussed here are applicable to other types of neural networks. Before answering the question, let us first quickly summarise how typical neural networks function. Note that the discussion is done in a time series forecasting context, so some of the arguments here are specific to that and are not relevant to classification tasks!

1. Multilayer Perceptron (MLP) neural networks

MLPs are a basic form of neural networks. Having a good understanding of these can help one understand most types of neural networks, as typically other types are constructed by adding more connections (such as feedbacks or skip-layer/direct connections). Let us assume that we have three different inputs, (X1, X2, X3), which could be different variables or lags of the target variables. A MLP with a single hidden layer, with 5 hidden nodes and a single output layer can be visualised as in Fig. 1.

Fig. 1. MLP with 3 inputs, 5 hidden nodes arranged in a single layer and a single output node.

An input (for example X1) is passed and processed through all 5 hidden nodes (Hi), the results of which are combined in the output (O1). If you prefer, the formula is:
Y_1 = b_0 + \sum_{i=1}^5 { \left( b_i f(a_{0,i} + \sum_{j=1}^3{a_{j,i}X_{j}}) \right) } ,     (1)
where b0 and a0,i are constants and bi and aj,i are weights for each input Xj and hidden node Hi. Looking carefully at either Eq. 1 or Fig. 1 we can observe that each neuron is a conventional regression that passes through a transfer function f() to become nonlinear. The neural network arranges several such neurons in a network, effectively passing the inputs through multiple (typically) nonlinear regressions and combining the results in the output node. This combination of several nonlinear regressions is what gives a neural network each approximation capabilities. With a sufficient number of nodes it is possible to approximate any function to an arbitrary degree of accuracy. Another interesting effect of this is that neural networks can encode multiple events using single binary dummy variables, as this paper demonstrates. We could add several hidden layers, resulting in a precursor to deep-learning. In principle we could make direct connections from the inputs to layers deeper in the network or the output directly (resulting in nonlinear-linear models) or feedback loops (resulting in recurrent networks).

The transfer function f() is typically either the logistic sigmoid or the hyperbolic tangent for regression problems. The output node typically uses a linear transfer function, acting as a conventional linear regression. To really understand how the input values are transformed to the network output, we need to understand how a single neuron functions.

2. Neurons

Consider a neuron as a nonlinear regression of the form (for the example with 3 inputs):

H_i = f(a_{0,i} + \sum_{j=1}^3{a_{j,i}X_{j}}) .     (2)

If f() is the identity function, then (2) becomes a conventional linear regression. If f() is nonlinear then the magic starts! Depending on the values of the weights aj,i and the constant a0,i the behaviour of neuron changes substantially. To better understand this let us take the example of a single input neuron and visualise the different behaviours. In the following interactive example you can choose:

  • the type of transfer function;
  • the values of the input, weight and constant.

The first plot shows the input-output values, the plot of the transfer function and with cyan background the area of values that can be considered by the neuron given selected weight and constant. The second plot provides a view of the neuron function, given the transfer function, weight and constant. Observe that the weight controls the width of the neuron and the constant the location, along the transfer function.

What is quite important to note here is that both logistic sigmoid and hyperbolic tangent squash the input between two values and the output cannot increase or decrease indefinitely, as with the linear. Also the combination of weight and constant can result in different forms of nonlinearities or approximate linear behaviours. As a side note, although I do not see MLP as anything to do with simulating biological networks, the sigmoid-type transfer functions are partly inspired by the stimulated or not states of biological neurons.

By now two things should become evident:

  • The scale of the inputs is very important for neural networks, as very large or small values result in the same constant output, essentially acting at the bounds of the neuron plots above. Although in theory it is possible to achieve the desired scaling using only appropriate weights and constants, training of networks is aided tremendously by scaling the inputs to a reasonable range, often close to [-1,1].
  • With sigmoid type transfer functions it is impossible to achieve an ever increasing/decreasing range of outputs. So for example if we were to use as an input a vector (1, 2, 3, 4, 5, 6, …, n) the output would be squashed between [0, 1] or [-1, 1] depending on the transfer function, irrespectively of how large n is.

Of course, as Eq. (1) suggests, in a neural network the output of a neuron is multiplied by a weight and shifted by a constant, so it is relatively easy to achieve output values much greater than the bounds of a single neuron. Nonetheless, a network will still “saturate” and reach a minimum/maximum value and cannot decrease/increase perpetually, unless non-squashing neurons are used as well (this is for example a case where direct connections to a linear output become useful). An example of this follows.

Suppose we want to predict the future values of a deterministic upward trend with no noise, of the form: yt = xt and xt = (1, 2, 3, 4, …). We scale the observed values between [-1, 1] to facilitate the training of the neural network. We use only 80% of the values for training the network and the remaining 20% to test the performance of the forecasts (test set A). We train a network with 5 logistic hidden nodes and a single linear output. Fig. 2 provides the resulting network with the trained weights and constants.

Fig. 2. Trained network for predicting deterministic trend.

The single input (scaled xt) is fed to all five nodes. Observe that it is multiplied with different weights (black numbers) and shifted by different constants (blue numbers) at each node. When additional inputs are used, the inherent difficulty in interpreting all these weights together, makes neural networks to be considered as black boxes. Fig. 3 provides the observed yt and predicted neural network values. The network is able to provide a very good fit in the training set and for most of test set A, but as the values increase (test set B) we can see that the networks starts to saturate (the individual nodes reach the upper bounds of the values they can output and eventually the whole network) and the predicted trend tapers off. As we saw earlier, each sigmoid-type node has a maximum value it can output.

Fig. 3. Actuals and neural network values for the case of deterministic trend with no noise.

This raises a significant doubt whether neural networks can forecast trended time series, if they are unable to model such an easy case. One would argue that with careful scaling of data (see good fit in test set A) it is possible to predict trends, but that implies that one knows the range that the future values would be in, to accommodate them with appropriate scaling. This information is typically unknown, especially when the trend is stochastic in nature.

3. Forecasting trends

Although forecasting trends is problematic when using raw data, we can pre-process the time series to enable successful modelling. We can remove any trend through differencing. Much like with ARIMA modelling, we overcome the problem of requiring the network to provide ever increasing/decreasing values and therefore we can model such series. For example, considering one of the yearly M3 competition series we can produce the following forecast:

Fig. 4. Neural network forecast for a trending yearly M3 competition series.

Fig. 4 provides the actuals and forecasts after differencing and scaling is applied, the forecast is produced and subsequently differencing and scaling are reversed. However there are some limitations to consider:

  • This approach implies a two stage model, where first zt = yt – yt-1 is constructed and then zt is modelled using neural networks. This imposes a set of modelling constraints that may be inappropriate.
  • The neural network is capable of capturing nonlinearities. However if such nonlinearities are connected to the level, for example multiplicative seasonality, then by using differences we are making it very difficult for the network to approximate the underlying functional form.
  • Differencing implies stochastic trend, which in principle is inappropriate when dealing with deterministic trend.

Therefore, it is fair to say that differencing is useful, but is by no means the only way to deal with trends and surely not always the best option. However, it is useful to understand how sigmoid-type neurons and networks are bound to fail in modelling raw trended time series. There have been several innovations in neural networks for forecasting, but most are bound by this limitation due to the transfer functions used.

So, can neural networks forecast trended time series? Fig. 4 suggests yes, but how to best do it is still an open question. Past research that I have been part of has shown that using differences is reliable and effective (for example see the specifications of neural networks here and here), even though there are unresolved problems with differencing. Surely just expecting the network to “learn” to forecasts trends is not enough.

Congratulations Dr. Svetunkov!

A couple of days ago it was the graduation ceremony for MSc and PhD students at Lancaster University. Ivan Svetunkov, one of my ex-PhD students officially graduated; well done Ivan! In a previous post I described briefly part of his research. He is also working on the excellent smooth package for R. You can find more about his work at his research blog. Also, well done to all the MSc graduates of the Dept. of Management Science, who had to survive my lectures!

The difference between in-sample fit and forecast performance

One of the fundamental differences in conventional model building, for example they way textbooks introduce regression modelling, and forecasting is how the in-sample fit statistics are used. In forecasting our focus is not a good description of the past, but a (hopefully) good prediction of the yet unseen values. One does not necessarily imply the other. A good fit does not necessarily lead to a good forecast, and vice-versa. For example, overfit models will typically have very small in-sample errors, but be terrible at forecasting. An example of the opposite case is shrinkage, where we sacrifice in-sample fit for good generalisation of the model.

A typical premise of forecasting model building is that a good fit in the past should lead to a reasonable forecast in the future. Implicitly we assume that there is some underlying data generating process, which we hopefully approximate well enough, to forecast it in the future. I do not believe that we can identify the true process for real data – I have come to doubt even if a true process exists, though this is a philosophical point. Here I will not make a discussion on how to avoid overfitting (for a related discussion see this post). Instead I will provide a slightly unusual example that I built for my students to exemplify the difference between in-sample fitting and forecasting.

For this example, I will not use a business time series, but music. We can extract from a song its waveform and analyse it as a time series, and even try to forecast it. There is a catch though, there is an inherent causality in music. It typically follows some logical structure, harmony and so on that a standard forecasting model is unaware of. For example, ARIMA or exponential smoothing, will attempt to find structure and repeating patterns in the past and extrapolate them in the future, unaware of any linguistics or musical constructs.

Let us take a song, sample its first 10 seconds, at 11,025 observations per second and fit an adequate ARIMA. Using standard unit root testing and AICc we identify an ARMA(5,0,4) as the best model. Note that the exact model used does not make much difference for this example, and different order ARIMA would provide more or less the same conclusions.

Using this model we forecast the next 5 seconds. We can assess the quality of the model fit and the forecast visually, but also as sound. Does the model fit sound anything like the original song? Does the forecast sound anything like the original song? Here it is a good point to warn you that I have chosen a sufficiently interesting song.

Fig. 1 provides a plot of the 1000 last historical observations and the ARMA fit, as well as the forecast for the next 500 observations, where we can easily evaluate the quality of the model fit.


Fig. 1. Plot of the 1000 last historical observations and model fit, as well as the next 500 forecasted periods.

We can observe that the model fit is relatively good, yet the forecast is very weak. In fact, the forecast quickly reverts to the mean and the prediction intervals “contain” all the song. This already makes it quite clear that a good fit does not mean a good forecast. The difference between the quality of the in-sample fit and the out-of-sample forecast is striking when “heard”.

The original sample:

The model fit:

The two sound samples are sufficiently similar. In fact, one can hear that the model fit is missing some of the higher pitches, which can be also seen in Fig. 1. Nonetheless, the fit sound sample is arguably retaining enough of the original song to be recognisable.

The forecast sounds like:

As seen in Fig. 1. the forecast very quickly becomes a flat line, which results in silence.

It is clear that although the model fit captures adequately the song, the forecast is useless. Why is this the case? The modelling methodology I followed here is geared towards good in-sample fit, rather than forecasting. Arguably the model family I used is also inappropriate. However, neither of these becomes apparent by consulting only the in-sample fit.

Generalising, consulting conventional in-sample statistics such as R2, mean squared error, etc. is not only useless, but also dangerous, when the model building purpose is forecasting. Consulting penalised statistics, such as information criteria (AIC, BIC, etc.) is useful, but again care should be taken. In our example the best model was chosen using AICc and it still did not produce any useful forecast. Business forecasting is not that dissimilar from the example here.

The reason that forecasting fails here is that there was no consideration during model fitting to identify the key drivers of the time series, in this case the musical and linguistic structure (though there are no lyrics in the sample used here). Instead, a generic extrapolative model was used. Business series often have similar complexities, such as promotions, price changes, events, and so on. These should be part of the model building exercise. Simply flagging observations as outlying, or merely identifying and fitting the “best” model from a limited pool of alternatives is a dangerous strategy. What I often argue to my students is that a good modeller should appreciate the problem context and use it! In-sample statistics are woefully inadequate in doing this, especially when forecasting is the objective.

A final note: all rights for the song are with the original holders. I am in no way a composer myself and the song is not my work! Can anyone recognise which song it is?