Author Archives: Nikos

Can you spot trend in time series?

Past experiments have demonstrated that humans (with or without formal training) are quite good at visually identifying the structure of time series. Trend is a key component, and arguably the most relevant to practice, as many of the forecasts that affect our lives have to do with potential increases or decreases of economic variables. Forecasters and econometricians often rely on formal statistical tests, while practitioners typically use their intuition, to assess whether a time series exhibits trend. There is fairly limited research contrasting these two. Furthermore, sometimes trend is understood with a rather vague definition. I do not think it is an exaggeration to suggest that even experts often can be confused on the exact definition (and effect) of a unit root and a trend.

To understand more about this, I set up a simple experiment to collect evidence how humans perceive trends. The experiment below asks you to distinguish between trended and non-trended time series. Every 10 time series that you will assess it will provide you with some statistics on your accuracy and the accuracy of some statistical tests (by no means an exhaustive list!). It also provides overall statistics from all participants so far. As you can see, it is no so trivial to identify correctly the presence of trend! What do you think, can you better than the average performance so far?

Forecasting with Temporal Hierarchies

G. Athanasopoulos, R. J. Hyndman, N. Kourentzes and F. Petropoulos, 2017, European Journal of Operational Research. http://doi.org/10.1016/j.ejor.2017.02.046

This paper introduces the concept of Temporal Hierarchies for time series forecasting. A temporal hierarchy can be constructed for any time series by means of non-overlapping temporal aggregation. Predictions constructed at all aggregation levels are combined with the proposed framework to result in temporally reconciled, accurate and robust forecasts. The implied combination mitigates modelling uncertainty, while the reconciled nature of the forecasts results in a unified prediction that supports aligned decisions at different planning horizons: from short-term operational up to long-term strategic planning. The proposed methodology is independent of forecasting models. It can embed high level managerial forecasts that incorporate complex and unstructured information with lower level statistical forecasts. Our results show that forecasting with temporal hierarchies increases accuracy over conventional forecasting, particularly under increased modelling uncertainty. We discuss organisational implications of the temporally reconciled forecasts using a case study of Accident & Emergency departments.

Download paper.

R package (thief).

Forecasting time series with neural networks in R

I have been looking for a package to do time series modelling in R with neural networks for quite some time with limited success. The only implementation I am aware of that takes care of autoregressive lags in a user-friendly way is the nnetar function in the forecast package, written by Rob Hyndman. In my view there is space for a more flexible implementation, so I decided to write a few functions for that purpose. For now these are included in the TStools package that is available in GitHub, but when I am happy with their performance and flexibility I will put them in a package of their own.

Here I will provide a quick overview of what these is available right now. I plan to write a more detailed post about these functions when I get the time.

For this example I will model the AirPassengers time series available in R. I have kept the last 24 observations as a test set and will use the rest to fit the neural networks. Currently there are two types of neural network available, both feed-forward: (i) multilayer perceptrons (use function mlp); and extreme learning machines (use function elm).

# Fit MLP
mlp.fit <- mlp(y.in)
plot(mlp.fit)
print(mlp.fit)

This is the basic command to fit an MLP network to a time series. This will attempt to automatically specify autoregressive inputs and any necessary pre-processing of the time series. With the pre-specified arguments it trains 20 networks which are used to produce an ensemble forecast and a single hidden layer with 5 nodes. You can override any of these settings. The output of print is a summary of the fitted network:

MLP fit with 5 hidden nodes and 20 repetitions.
Series modelled in differences: D1.
Univariate lags: (1,3,4,6,7,8,9,10,12)
Deterministic seasonal dummies included.
Forecast combined using the median operator.
MSE: 6.2011.

As you can see the function determined that level differences are needed to capture the trend. It also selected some autoregressive lags and decided to also use dummy variables for the seasonality. Using plot displays the architecture of the network (Fig. 1).

Fig. 1. Output of plot(mlp.fit).

The light red inputs represent the binary dummies used to code seasonality, while the grey ones are autoregressive lags. To produce forecasts you can type:

mlp.frc <- forecast(mlp.fit,h=tst.n)
plot(mlp.frc)

Fig. 2 shows the ensemble forecast, together with the forecasts of the individual neural networks. You can control the way that forecasts are combined (I recommend using the median or mode operators), as well as the size of the ensemble.

Fig. 2. Output of the plot function for the MLP forecasts.

You can also let it choose the number of hidden nodes. There are various options for that, but all are computationally expensive (I plan to move the base code to CUDA at some point, so that computational cost stops being an issue).

# Fit MLP with automatic hidden layer specification
mlp2.fit <- mlp(y.in,hd.auto.type="valid",hd.max=10)
print(round(mlp2.fit$MSEH,4))

This will evaluate from 1 up to 10 hidden nodes and pick the best on validation set MSE. You can also use cross-validation (if you have patience…). You can ask it to output the errors for each size:

        MSE
H.1  0.0083
H.2  0.0066
H.3  0.0065
H.4  0.0066
H.5  0.0071
H.6  0.0074
H.7  0.0061
H.8  0.0076
H.9  0.0083
H.10 0.0076

There are a few experimental options in specifying various aspects of the neural networks, which are not fully documented and is probably best if you stay away from them for now!

ELMs work pretty much in the same way, although for these I have made default the automatic specification of the hidden layer.

# Fit ELM
elm.fit <- elm(y.in)
print(elm.fit)
plot(elm.fit)

This gives the following network summary:

ELM fit with 100 hidden nodes and 20 repetitions.
Series modelled in differences: D1.
Univariate lags: (1,3,4,6,7,8,9,10,12)
Deterministic seasonal dummies included.
Forecast combined using the median operator.
Output weight estimation using: lasso.
MSE: 83.0044.

I appreciate that using 100 hidden nodes on such a short time series can make some people uneasy, but I am using a shrinkage estimator instead of conventional least squares to estimate the weights, which in fact eliminates most of the connections. This is apparent in the network architecture in Fig. 3. Only the nodes connected with the black lines to the output layer contribute to the forecasts. The remaining connection weights have been shrunk to zero.

Fig. 3. ELM network architecture.

Another nice thing about these functions is that you can call them from the thief package, which implements Temporal Hierarchies forecasting in R. You can do that in the following way:

# Use THieF
library(thief)
mlp.thief <- thief(y.in,h=tst.n,forecastfunction=mlp.thief)

There is a similar function for using ELM networks: elm.thief.

Since for this simple example I kept some test set, I benchmark the forecasts against exponential smoothing:

Method MAE
MLP (5 nodes) 62.471
MLP (auto) 48.234
ELM 48.253
THieF-MLP 45.906
ETS 64.528

Temporal hierarchies, like MAPA, are great for making your forecasts more robust and often more accurate. However, with neural networks the additional computational cost is evident!

These functions are still in development, so the default values may change and there are a few experimental options that may give you good results or not!

Temporal Big Data for Tire Industry Tactical Sales Forecasting

Y.R. Sagaert, E-H.  Aghezzaf, N. Kourentzes and B. Desmet, 2017, Interfaces.

We propose a forecasting method to improve accuracy for tactical sales predictions at a major supplier to the tire industry. This level of forecasting serves as direct input for the demand planning, steering the global supply chain and is typically up to a year ahead. The case company has a product portfolio that is strongly sensitive to external events. Univariate statistical methods, which are common in practice, are unable to anticipate and forecast changes in the market, while human expert forecasts are known to be biased and inconsistent. The proposed method is able to automatically identify key leading indicators that drive sales from a massive set of macro-economic indicators, across different regions and markets and produce accurate forecasts. Our method is able to handle the additional complexity of the short and long term dynamics from the product sales and the external indicators. We find that accuracy is improved by 16.1% over current practice with proportional benefits for the supply chain. Furthermore, our method provides transparency to the market dynamics, allowing managers to better understand the events and economic variables that affect the sales of their products.

Download paper.

Can neural networks predict trended time series?

Yes and… no! First, I should say that I am thinking of the common types of neural networks that are comprised by neurons that use some type of sigmoid transfer function, although the arguments discussed here are applicable to other types of neural networks. Before answering the question, let us first quickly summarise how typical neural networks function. Note that the discussion is done in a time series forecasting context, so some of the arguments here are specific to that and are not relevant to classification tasks!

1. Multilayer Perceptron (MLP) neural networks

MLPs are a basic form of neural networks. Having a good understanding of these can help one understand most types of neural networks, as typically other types are constructed by adding more connections (such as feedbacks or skip-layer/direct connections). Let us assume that we have three different inputs, (X1, X2, X3), which could be different variables or lags of the target variables. A MLP with a single hidden layer, with 5 hidden nodes and a single output layer can be visualised as in Fig. 1.

Fig. 1. MLP with 3 inputs, 5 hidden nodes arranged in a single layer and a single output node.

An input (for example X1) is passed and processed through all 5 hidden nodes (Hi), the results of which are combined in the output (O1). If you prefer, the formula is:
Y_1 = b_0 + \sum_{i=1}^5 { \left( b_i f(a_{0,i} + \sum_{j=1}^3{a_{j,i}X_{j}}) \right) } ,     (1)
where b0 and a0,i are constants and bi and aj,i are weights for each input Xj and hidden node Hi. Looking carefully at either Eq. 1 or Fig. 1 we can observe that each neuron is a conventional regression that passes through a transfer function f() to become nonlinear. The neural network arranges several such neurons in a network, effectively passing the inputs through multiple (typically) nonlinear regressions and combining the results in the output node. This combination of several nonlinear regressions is what gives a neural network each approximation capabilities. With a sufficient number of nodes it is possible to approximate any function to an arbitrary degree of accuracy. Another interesting effect of this is that neural networks can encode multiple events using single binary dummy variables, as this paper demonstrates. We could add several hidden layers, resulting in a precursor to deep-learning. In principle we could make direct connections from the inputs to layers deeper in the network or the output directly (resulting in nonlinear-linear models) or feedback loops (resulting in recurrent networks).

The transfer function f() is typically either the logistic sigmoid or the hyperbolic tangent for regression problems. The output node typically uses a linear transfer function, acting as a conventional linear regression. To really understand how the input values are transformed to the network output, we need to understand how a single neuron functions.

2. Neurons

Consider a neuron as a nonlinear regression of the form (for the example with 3 inputs):

H_i = f(a_{0,i} + \sum_{j=1}^3{a_{j,i}X_{j}}) .     (2)

If f() is the identity function, then (2) becomes a conventional linear regression. If f() is nonlinear then the magic starts! Depending on the values of the weights aj,i and the constant a0,i the behaviour of neuron changes substantially. To better understand this let us take the example of a single input neuron and visualise the different behaviours. In the following interactive example you can choose:

  • the type of transfer function;
  • the values of the input, weight and constant.

The first plot shows the input-output values, the plot of the transfer function and with cyan background the area of values that can be considered by the neuron given selected weight and constant. The second plot provides a view of the neuron function, given the transfer function, weight and constant. Observe that the weight controls the width of the neuron and the constant the location, along the transfer function.

What is quite important to note here is that both logistic sigmoid and hyperbolic tangent squash the input between two values and the output cannot increase or decrease indefinitely, as with the linear. Also the combination of weight and constant can result in different forms of nonlinearities or approximate linear behaviours. As a side note, although I do not see MLP as anything to do with simulating biological networks, the sigmoid-type transfer functions are partly inspired by the stimulated or not states of biological neurons.

By now two things should become evident:

  • The scale of the inputs is very important for neural networks, as very large or small values result in the same constant output, essentially acting at the bounds of the neuron plots above. Although in theory it is possible to achieve the desired scaling using only appropriate weights and constants, training of networks is aided tremendously by scaling the inputs to a reasonable range, often close to [-1,1].
  • With sigmoid type transfer functions it is impossible to achieve an ever increasing/decreasing range of outputs. So for example if we were to use as an input a vector (1, 2, 3, 4, 5, 6, …, n) the output would be squashed between [0, 1] or [-1, 1] depending on the transfer function, irrespectively of how large n is.

Of course, as Eq. (1) suggests, in a neural network the output of a neuron is multiplied by a weight and shifted by a constant, so it is relatively easy to achieve output values much greater than the bounds of a single neuron. Nonetheless, a network will still “saturate” and reach a minimum/maximum value and cannot decrease/increase perpetually, unless non-squashing neurons are used as well (this is for example a case where direct connections to a linear output become useful). An example of this follows.

Suppose we want to predict the future values of a deterministic upward trend with no noise, of the form: yt = xt and xt = (1, 2, 3, 4, …). We scale the observed values between [-1, 1] to facilitate the training of the neural network. We use only 80% of the values for training the network and the remaining 20% to test the performance of the forecasts (test set A). We train a network with 5 logistic hidden nodes and a single linear output. Fig. 2 provides the resulting network with the trained weights and constants.

Fig. 2. Trained network for predicting deterministic trend.

The single input (scaled xt) is fed to all five nodes. Observe that it is multiplied with different weights (black numbers) and shifted by different constants (blue numbers) at each node. When additional inputs are used, the inherent difficulty in interpreting all these weights together, makes neural networks to be considered as black boxes. Fig. 3 provides the observed yt and predicted neural network values. The network is able to provide a very good fit in the training set and for most of test set A, but as the values increase (test set B) we can see that the networks starts to saturate (the individual nodes reach the upper bounds of the values they can output and eventually the whole network) and the predicted trend tapers off. As we saw earlier, each sigmoid-type node has a maximum value it can output.

Fig. 3. Actuals and neural network values for the case of deterministic trend with no noise.

This raises a significant doubt whether neural networks can forecast trended time series, if they are unable to model such an easy case. One would argue that with careful scaling of data (see good fit in test set A) it is possible to predict trends, but that implies that one knows the range that the future values would be in, to accommodate them with appropriate scaling. This information is typically unknown, especially when the trend is stochastic in nature.

3. Forecasting trends

Although forecasting trends is problematic when using raw data, we can pre-process the time series to enable successful modelling. We can remove any trend through differencing. Much like with ARIMA modelling, we overcome the problem of requiring the network to provide ever increasing/decreasing values and therefore we can model such series. For example, considering one of the yearly M3 competition series we can produce the following forecast:

Fig. 4. Neural network forecast for a trending yearly M3 competition series.

Fig. 4 provides the actuals and forecasts after differencing and scaling is applied, the forecast is produced and subsequently differencing and scaling are reversed. However there are some limitations to consider:

  • This approach implies a two stage model, where first zt = yt – yt-1 is constructed and then zt is modelled using neural networks. This imposes a set of modelling constraints that may be inappropriate.
  • The neural network is capable of capturing nonlinearities. However if such nonlinearities are connected to the level, for example multiplicative seasonality, then by using differences we are making it very difficult for the network to approximate the underlying functional form.
  • Differencing implies stochastic trend, which in principle is inappropriate when dealing with deterministic trend.

Therefore, it is fair to say that differencing is useful, but is by no means the only way to deal with trends and surely not always the best option. However, it is useful to understand how sigmoid-type neurons and networks are bound to fail in modelling raw trended time series. There have been several innovations in neural networks for forecasting, but most are bound by this limitation due to the transfer functions used.

So, can neural networks forecast trended time series? Fig. 4 suggests yes, but how to best do it is still an open question. Past research that I have been part of has shown that using differences is reliable and effective (for example see the specifications of neural networks here and here), even though there are unresolved problems with differencing. Surely just expecting the network to “learn” to forecasts trends is not enough.

Congratulations Dr. Svetunkov!

A couple of days ago it was the graduation ceremony for MSc and PhD students at Lancaster University. Ivan Svetunkov, one of my ex-PhD students officially graduated; well done Ivan! In a previous post I described briefly part of his research. He is also working on the excellent smooth package for R. You can find more about his work at his research blog. Also, well done to all the MSc graduates of the Dept. of Management Science, who had to survive my lectures!

The difference between in-sample fit and forecast performance

One of the fundamental differences in conventional model building, for example they way textbooks introduce regression modelling, and forecasting is how the in-sample fit statistics are used. In forecasting our focus is not a good description of the past, but a (hopefully) good prediction of the yet unseen values. One does not necessarily imply the other. A good fit does not necessarily lead to a good forecast, and vice-versa. For example, overfit models will typically have very small in-sample errors, but be terrible at forecasting. An example of the opposite case is shrinkage, where we sacrifice in-sample fit for good generalisation of the model.

A typical premise of forecasting model building is that a good fit in the past should lead to a reasonable forecast in the future. Implicitly we assume that there is some underlying data generating process, which we hopefully approximate well enough, to forecast it in the future. I do not believe that we can identify the true process for real data – I have come to doubt even if a true process exists, though this is a philosophical point. Here I will not make a discussion on how to avoid overfitting (for a related discussion see this post). Instead I will provide a slightly unusual example that I built for my students to exemplify the difference between in-sample fitting and forecasting.

For this example, I will not use a business time series, but music. We can extract from a song its waveform and analyse it as a time series, and even try to forecast it. There is a catch though, there is an inherent causality in music. It typically follows some logical structure, harmony and so on that a standard forecasting model is unaware of. For example, ARIMA or exponential smoothing, will attempt to find structure and repeating patterns in the past and extrapolate them in the future, unaware of any linguistics or musical constructs.

Let us take a song, sample its first 10 seconds, at 11,025 observations per second and fit an adequate ARIMA. Using standard unit root testing and AICc we identify an ARMA(5,0,4) as the best model. Note that the exact model used does not make much difference for this example, and different order ARIMA would provide more or less the same conclusions.

Using this model we forecast the next 5 seconds. We can assess the quality of the model fit and the forecast visually, but also as sound. Does the model fit sound anything like the original song? Does the forecast sound anything like the original song? Here it is a good point to warn you that I have chosen a sufficiently interesting song.

Fig. 1 provides a plot of the 1000 last historical observations and the ARMA fit, as well as the forecast for the next 500 observations, where we can easily evaluate the quality of the model fit.

fit_example_fig1

Fig. 1. Plot of the 1000 last historical observations and model fit, as well as the next 500 forecasted periods.

We can observe that the model fit is relatively good, yet the forecast is very weak. In fact, the forecast quickly reverts to the mean and the prediction intervals “contain” all the song. This already makes it quite clear that a good fit does not mean a good forecast. The difference between the quality of the in-sample fit and the out-of-sample forecast is striking when “heard”.

The original sample:

The model fit:

The two sound samples are sufficiently similar. In fact, one can hear that the model fit is missing some of the higher pitches, which can be also seen in Fig. 1. Nonetheless, the fit sound sample is arguably retaining enough of the original song to be recognisable.

The forecast sounds like:

As seen in Fig. 1. the forecast very quickly becomes a flat line, which results in silence.

It is clear that although the model fit captures adequately the song, the forecast is useless. Why is this the case? The modelling methodology I followed here is geared towards good in-sample fit, rather than forecasting. Arguably the model family I used is also inappropriate. However, neither of these becomes apparent by consulting only the in-sample fit.

Generalising, consulting conventional in-sample statistics such as R2, mean squared error, etc. is not only useless, but also dangerous, when the model building purpose is forecasting. Consulting penalised statistics, such as information criteria (AIC, BIC, etc.) is useful, but again care should be taken. In our example the best model was chosen using AICc and it still did not produce any useful forecast. Business forecasting is not that dissimilar from the example here.

The reason that forecasting fails here is that there was no consideration during model fitting to identify the key drivers of the time series, in this case the musical and linguistic structure (though there are no lyrics in the sample used here). Instead, a generic extrapolative model was used. Business series often have similar complexities, such as promotions, price changes, events, and so on. These should be part of the model building exercise. Simply flagging observations as outlying, or merely identifying and fitting the “best” model from a limited pool of alternatives is a dangerous strategy. What I often argue to my students is that a good modeller should appreciate the problem context and use it! In-sample statistics are woefully inadequate in doing this, especially when forecasting is the objective.

A final note: all rights for the song are with the original holders. I am in no way a composer myself and the song is not my work! Can anyone recognise which song it is?

Talk on `Advances in promotional modelling and demand forecasting’

On 16/11/2016 I gave a talk at the Stockholm School of Economics on the topic of advances in modelling and demand forecasting. Given the diversity of the audience I avoided going into the details of the mathematical formulations, some of which can be found in the appendix of the presentation. The presentation summarises three different research areas I have been active recently, with the help of colleagues and PhD students:

  • temporal aggregation and temporal hierarchies;
  • promotional modelling at SKU level;
  • use of online search data for lifecycle modelling.

You can download the slides here.

MAPAx available for R & new MAPA package on CRAN

The previous version of the MAPA package implemented only the univariate aspect of the algorithm. Version 2.0.1 implements MAPAx as well, which allows incorporating regressors in your forecasts. In this paper we demonstrated the usefulness of temporal aggregation in the case of forecasting demand in the presence of promotions. In particular, we showed that MAPAx substantially outperformed regression promotions models (of this type), as well as exponential smoothing with promotional information (see Fig. 1).

mapax_res

Fig. 1. Bias and accuracy results for heavily promoted items. Description and details of the case study and experiment can be found here.

To use MAPAx call the usual mapa functions with the addition of the argument xreg. You can also control whether the regressors are transformed using principal components or not, using the argument pr.comp.  By default no transformation is performed. Note that MAPAx can only be used when type="es". It is recommended that when the regressors are related to high-frequency information, such as promotions, to use the combination options (argument comb): w.mean or w.median that weight high-frequency seasonality and xreg states more heavily.

The following example illustrates the strength of MAPAx when additional information is available. We model the demand of a promoted product with conventional MAPA and MAPAx. Fig. 2 provides the in-sample rolling forecasts and the out-of-sample forecast for 13 weeks.

mapax

Fig. 2. MAPA and MAPAx forecasts for the demand of a promoted product.

Observe that MAPAx models the past and future promotions and also provides tighter prediction intervals, since more information in the past sales is captured.

The stable version of the package with MAPAx is 2.0.1 and is available on CRAN. The development version, which includes the latest bugfixes, can be found on GitHub, where I also provide the versions available at both CRAN and GitHub so you can check whether you have the latest version or not. MAPAx, similar to high-frequency MAPA, requires the smooth package. Make sure you have the latest version of that as well!

MAPA package for R on GitHub

Here is the link: https://github.com/trnnick/mapa

It has been a long time I wanted to rework the MAPA package for R, but I could not find the time. Finally I got around starting it. There are three objectives in this:

  1. Clean up code and introduce S3methods. MAPA was the first package that I wrote!
  2. Incorporate the es exponential smoothing implementation from the smooth package. This is an alternative to ets that will remain as the default. It allows longer seasonal cycles and some additional customisation in the specification of the exponential smoothing models.
  3. Implement MAPAx that permits introducting exogenous variables to MAPA.

Tidying up the code is an ever-ongoing process, so (1) will never finish! I have already sorted (2) out and the next task is to work on (3). You can grab the development version on GitHub. At the time of writing you will need the development version of smooth for this to work, on which Ivan has been ironing out a few peculiar bugs that MAPA brought to surface. You will be amazed how much temporal aggregation can mess up otherwise stable code!

Here is a sneak peak of things you could not do with MAPA before:

mapafit <- mapaest(taylor,ppy=336,type="es")
plot(mapafit)
mgit_fig1

Fig. 1: Look at all these aggregation levels!

The new argument is type="es" that lets MAPA know what exponential smoothing implementation to use. Once the estimation is done we can use this to forecast as usual.

mapacalc(taylor,mapafit,comb="wght",outplot=2)
mgit_fig2

Fig. 2: MAPA for high frequency data!

So finally we can now easily use MAPA for series with seasonality more than 24 periods! You may have observed that I am using a different combination of the temporal aggregation states: wght. This is still experimental, so use it at your own risk! I would still recommend using either median or mean.

Another nice feature that the es implementation offers is more flexibility in specifying models. Before we could use the model="ZZZ" to let ets select the best model for each aggregation level. Now we can use the letters X and Y to restrict the search to only additive or  multiplicative options (including N) respectively. So for example we could set model="ZXA" to have MAPA look for exponential smoothing models with any type of error, trend that can be none, additive or damped additive and additive seasonality.

Which of ets or es is most accurate to use with MAPA? This is a difficult question to answer. In different examples I tried, I got different results, but there are a few things I can point out. ets from the forecast package by default does not allow multiplicative trends. This was done as using only additive trends seems to work better for ets. This is not the case for es that considers all possible trends. In the current MAPA implementation you can use the argument allow.multiplicative.trend=TRUE to make ets consider all trends. Similarly, with es you can use model="ZXZ" to restrict it to additive trends. On the M3 dataset, once the specification is identical (only additive trends) es based MAPA performs about 5% worse than ets based MAPA, which is fairly small difference, with no evidence of statistically significant differences. So my recommendation is as follows: MAPA with ets works fine and is the default core, but if the additional flexibility of es is needed, feel confident in using it!