Author Archives: Nikos

Towards the “one-number forecast”

1. Introductory remarks

One of the recurrent topics in online discussions on sales forecasting and demand planning is the idea of the “one-number forecast”, that is a common view of the future on which multiple plans and decisions can be made, from different functions of an organisation. In principle, this is yet another idea around the notion that we must break information silos within organisations. This topic has been hotly debated in practice, but academia has been somewhat silent. There are good reasons for this. I would argue that a key to this is that many colleagues in forecasting, econometrics and machine learning are predominantly focused on algorithmic and modelling research questions, rather than the organisational context within which forecasts are generated. This naturally results in different research focus. Given my background in strategic management, I always like to ponder on the organisational aspects of forecasting – even though I admit that most of my research revolves around algorithmic and modelling questions!

Over the years I have written a lot about the benefits of multiple temporal aggregation, either in the form of MAPA or Temporal Hierarchies, in terms of achieving both accuracy gains and importantly alignment between short- and long-term forecasts, as well as allowing operational information to pass on seamlessly to strategic decision makers and vice-versa. Yet, this still leaves the cross-sectional aspect of the forecasting problem (for example, different product categories, market segments, etc.) somewhat disconnected. Keeping these two streams of forecasting research disconnected has left the so-called one-number forecast beyond our modelling capabilities and to the sphere of organisational and process design and management (see S&OP, IBP, or other various names of the idea that people should… talk – hint: works in every aspect of life!).

Over the years, with colleagues, I have approached the problem from various aspects, but I think I finally have a practical solution to this that I am happy to recommend!

2. Why bother?

So what is the limitation of traditional forecasting? Why do we need this more integrated thinking?

  • Forecasts build for different functions/decisions are typically based on different information and therefore these are bound to differ and ignore much of the potentially available information.
  • Statistically speaking, given different target forecast horizons or different input information, typically a different model is more appropriate. The resulting forecasts are bound to differ as well.
  • Different functions/decisions need forecasts at different frequencies. We need to account for different decision-making frequency and speed for operational decisions (for example, inventory management) and different for tactical/strategic (for example, location and capacity of a new warehouse).
  • Forecasts that differ will provide misaligned decisions, which will result in organisational friction, additional costs and lost opportunities.
  • Different forecasts give a lot of space for organisational politics: my forecast is better than yours! This is often resolved top-down, which eliminates important information that in principle is available to the organisations. Organisational politics and frictions are a leading reason for silos.
  • A quite simple argument: if you have many different forecasts about the same thing that do not agree, most, if not all, are wrong. (Yes statistically speaking all forecasts are wrong, but practically speaking many are just fine and safe to use!).

What is wrong with using cross-sectional hierarchical forecasting (bottom-up, top-down, etc.) to merge forecasts together?

  • First, none of bottom-up, top-down or middle-out are the way to go. The optimal combination (or MinT) methodology is more meaningful and eliminates the need for a modelling choice that is not grounded on any theoretical understanding of the forecasting problem.
  • Cross-sectional hierarchical forecasting can indeed provide aggregate coherent (that is forecasts on different levels, such as SKUs sales and product category sales, that add up perfectly), but they do so for a single time instance. Let’s make this practical. Having coherent forecasts at the bottom level, where say weekly forecasts of SKUs are available is meaningful. As we go to higher levels of the hierarchy, is there any value on weekly total sales of the company? More importantly, apart from the statistical convenience of such as forecast, is there any meaningful information that senior management can add on a weekly basis (or would they bother?).

What is wrong with using temporal hierarchical forecasting to merge forecasts together?

  • This is the complimentary problem of cross-sectional forecasting. Now we have the temporal dynamics captured. We merge together information that is relevant for short-term forecasting, but also long-term, to gain benefits in both. However, SKU sales on weekly frequency are useful, but SKU annual sales not so. Probably there you need product group of even more aggregate figures at annual buckets of sales.

What is wrong with building super-models that get all the information in one go and produce outputs for everything?

  • That should sound dodgy! If this was a thing, then this blog wouldn’t really exist…
  • On a more serious note, statistically speaking this is a very challenging problem, both in terms of putting down the equations for such a model, but also estimating meaningful parameters. Economics has failed repeatedly in doing this for macro-economy and there are well understood and good reasons why our current statistical and mathematical tools fail at that. I underline current because research is ongoing!
  • From an organisational point of view, that would require a data integration maturity, as well as almost no silos between functions and teams, so as to be able to get all the different sources of data in the model in a continuous and reliable form. Again, this is not theoretically impossible, but my experience in working with some of the leading companies in various sectors is that we are not there yet.

So getting true one-number forecasts is more difficult that one would like. Does it worth the effort?

  • If different functions/decision makers have the same view about the future, they are better informed and naturally will tend to take more aligned decisions, with all the organisational and financial benefits.
  • If such forecasts were possible, it would enable overcoming many of the organisational silos in a data-driven way. Between innovating human organisations and behaviours or statistics, it is somewhat easy to guess which one is easier!

3. How to build one-number forecasts?

Let me say upfront that:

  • It is all about bringing together the cross-sectional and temporal hierarchies. There is a benefit to this: both are mature enough to offer substantial modelling flexibility and therefore they do not restrict our forecasting toolbox, including statistics, machine learning, managerial expertise or “expert” judgement.
  • It can be done in a modular fashion, so forecasts do not need to be fully blended from the onset, but as the organisation gains more maturity, then more functions can contribute to the one-number forecast, so that we move towards the ultimate goal in practical and feasible steps.
  • Therefore, what follows can be implemented within the existing machinery of business forecasting (please don’t ask me how to do this in Excel! It can be done, but why?).
  • For anyone interested, this is the relevant paper (and references within), but quite readily I admit that papers are often not written to be… well, accessible. I hope that readers of my academic work will at least feel that I try to put some effort to make my work accessible to varying degrees of success!

3.1. The hierarchical forecasting machinery

We need to start with the basics of how to blend forecasts for different items together. Suppose we have a hierarchy of products, such as the one in Figure 1. This is a fairly generic hierarchy that we could imagine it describes sales of SKUs XX, XY and YX, YY and YZ, which can be grouped together in product groups X and Y, which in turn can be aggregated to Total sales. This hierarchy also implies that there are some coherence constraints: Total = X + Y, X = XX + XY, Y = YX + YY + YZ. This is surely true for past sales and it should be true for any forecasts.

Figure 1. Total sales can be broken into product groups X and Y, which in turn contain SKUs XX, XY and YX, YY and YX.

This restriction, that the coherence should be true for forecasts, is very helpful in giving us a mechanism to blend forecasts. This is not a top-down or bottom-up problem. The reason is that each level is relevant to different parts of an organisation and they have different information available. SKU level sees the detailed demand and interaction with customers, on high frequency. This is very relevant, for example, for inventory management. Product/brand level sees the aggregate demand patterns, the perception of a brand, etc., which is very relevant, for example, to marketing. Budgeting and financial operations would be very interested in the total level. All these different functions will most probably have different information sets available and should be using different models (or “expert guessing”) to build forecasts. Primarily, as these models are required to give different types of outputs, for different time scales. For example, inventory planners need forecasts and uncertainties to plan safety stocks and orders, typically for short horizons. Marketing needs elasticities of promotions and pricing and potentially longer-term forecasts. Financial operations even longer horizons and forecasts expressed in monetary terms, rather than product units. Therefore, it is not just about making numbers match, but it is about bringing different organisational views together. Top-down and bottom-up fail completely on this aspect.

So how are we to bring different views together? Let us abstract the problem a bit (because using X, XX, and so on, was very specific for my taste!). The hierarchy in Figure 1 can be written as the following matrix S.

The structure of S codifies the hierarchy. Each column corresponds to a bottom level time series and each row to a node/level of the hierarchy. We place 1 when a bottom-level (column) element contributes to that level and 0 otherwise. With this structure, if one would take the SKU sales and pass them through this S (for summing) matrix, then the outcome would be the sales for all levels of the hierarchy (rows).

If instead of sales we had forecasts for the bottom level, using S we can produce bottom-up forecasts for the complete hierarchy. Likewise, if we had forecasts for only the top level (Total) then we could use the mapping in S to come up with a way to disaggregate forecasts to the lower levels. A couple of paragraphs above I argued that we need forecasts at all levels. If we do that, the same S we help us understand how much our forecast disagree: how decoherent they are. Skipping the mathematical derivations, it has been shown that the following equation can take any raw decoherent forecasts and reconcile them, by attempting to minimise the reconciliation errors, that is how much the forecasts disagree.

I am using matrix notation to avoid writing massive messy formulas. What the above says is: give me all your initial forecasts and I will multiply them with a matrix G that contains some combination weights, the S matrix that maps the hierarchy and I will give you back coherent forecasts. This is fairly easy if G is known. Before we go into the estimation of G there are some points useful to stress, which are typically not given enough attention in hierarchical forecasting:

  • Hierarchical forecasting is merely a forecast combination exercise, where we linearly combine (independent) forecasts of different levels.
  • Combinations of forecasts are desirable. Statistically, they typically lead to more accurate forecasts (this is why hierarchical forecasting often relates to accuracy gains), but also substantially mitigates the model selection problem, as it is okay to get some models wrong.
  • That the forecasts can be independent is a tremendous advantage for practice. At each node/level we can produce forecasts separately, based on different information and forecasting techniques, matching the requirements as needed. Statistically speaking, independent forecasts may not be theoretically elegant, but they are certainly much simpler to specify and estimate, so quite useful for practice!
  • There is no need to aggregate/disaggregate. Hierarchical forecasting directly produces forecasts for all levels.

Let us return to the estimation of G. The formula for this is:

That means that G is dependent on the map of the hierarchy in S and the forecast errors. Estimating W is not straightforward (for details and example see this paper, section 2), but suffice to say that it accounts for the forecast errors, or in other words the quality of the forecast at each node. In a nutshell, poor forecasts will be given less weight than better forecasts. Consider the following: if all forecasts were perfect, then they would be coherent and no need to reconcile the forecasts. Of course, in practice they are not, and we prefer to adjust more the inaccurate forecasts, as the chance is that they are probably more responsible for the lack of coherence.

3.2. Cross-sectional hierarchical forecasting

This is the standard form of hierarchical forecasting that most people relate to through the bottom-up and top-down forecasting logic. In this case, G plays the role of aggregation or disaggregation weights. I am hesitant to use the forecast combination logic in this case, as we do not use forecasts from different levels of the hierarchy, but we combine forecasts from a single level only. This increases the forecasting risk, as we rely on fewer models, forecasts and less information.

Using the machinery described above one could forecast each node in the hierarchy independently (the machinery does not preclude using models capable of producing forecasts for multiple nodes simultaneous). For example, at the very disaggregate level, one might use intermittent demand forecasting methods, or other automatic univariate forecasting models, such as exponential smoothing. The reason being is that in practice the bottom levels are very large, typically containing (many) thousands of time series, which need to be forecasted very frequently. Here automation and reliability are essential. At higher levels, explanatory variables may be more relevant. For instance, one could use at the higher levels of the hierarchy leading indicators from the macro-economic environment to augment the forecasts. Also at that level, there are fewer forecasts to be made, so human experts can intervene more easily. Such information may be difficult to connect to the time series at the lower levels of the hierarchy.

Using cross-sectional hierarchical forecasting the different forecasts from different nodes/levels are blended, providing typically more accurate predictions and aligned decisions. In principle, if for every node we would produce the best practically possible forecasts, the blended coherent forecasts would contain all this information, as well as providing a common view of the future. The catch is that the whole hierarchy is tied to the same time bucket. If say the lowest level is at a daily or weekly sampling frequency, so is the top level. At the very aggregate level decision making is typically slower than at the very disaggregate operational level. This mismatch makes cross-sectional hierarchical forecasting useful for some aspects, but at the same time reduces it to a statistical exercise that hopes merely for forecast accuracy improvements.

3.3. Temporal hierarchies

Temporal hierarchies use the same machinery to solve the problem across time. I have covered this topic in more detail in previous posts, so I will be brief here.  Suppose we deal with a quarterly time series. This implies a hierarchical structure as in Figure 2.

Figure 2. A time series sampled in quarterly frequency forms an implicit temporal hierarchy, where an annum is split into two semi-annual periods, which are split into two quarters each.

Of course, we can define that hierarchy for monthly, daily, etc. time series. It should be quite evident in comparing Figures 1 and 2 that we can construct a summing matrix S for Figure 2 and produce coherent forecasts as needed. In this case, we achieve temporal coherency. That is, short-term lower level forecasts are aligned with long-term top-level forecasts. In practice, high-frequency short-term decision making is informed by long-term decision making and vice-versa. Temporal hierarchies offer substantial accuracy gains, due to seeing the time series from various aggregation viewpoints, hence capture both short- and long-term dynamics, but also mitigate the problem of model selection, as they naturally force the modeller to rely on multiple forecasts.

On the downside, although temporal hierarchies are a very handy statistical device for getting better quality forecasts, they do not always translate one-to-one with the relevant organisational decision making. For example, suppose that we model the daily demand of a particular SKU. Temporal hierarchies will be helpful in getting better forecasts. At the daily level and the levels close to it, decisions about stocking at stores, warehouses, etc. will be informed. As we get to the top levels, the forecasts may not relate directly to some decision. Do we need annual forecasts for a single SKU?

3.4. Cross-temporal hierarchies

The natural extension is to construct hierarchies that span both cross-sectional and temporal dimensions. Figure 3 illustrates how one could construct such a hierarchy. Each cross-sectional (blue) node, contains a temporal hierarchy (yellow). Here is where things start to become complicated. Expressing this hierarchy with a summing matrix S is not straightforward!

Figure 3. A cross-temporal hierarchy. Each cross-sectional node (blue), contains a temporal hierarchy (yellow).

With colleagues we have done some work in doing exactly that, only to realise that this needs a lot more thinking than just blindly adding columns and rows to S. For small hierarchies this may be feasible, but for large realistic ones, this becomes unmanageable very fast. This is work in progress, hopefully soon enough I will have something better to say on this!

Another approach is to do this sequentially. One could first do temporal and then cross-sectional, or the other way around. It appears that by first doing temporal, the forecasting exercise becomes simpler. However, the sequential application of the hierarchical machinery does not guarantee that forecasts will remain coherent across both dimensions. In fact, unless you have perfect forecasts it is easy to demonstrate that the second reconciliation will cause decoherence of the first. Figure 4 is helpful to understand this, but also to see the way forward.

Figure 4. Cross-sectional hierarchies on different temporal levels.

The cross-sectional hierarchy, from Figure 1, will remain applicable irrespective of the temporal level it is modelled at. Cross-sectional hierarchical forecasting merely chooses one of these levels and models everything there. Suppose we would model each of these cross-sectional hierarchies. The structure captured by S will stay the same, however G will most probably not, as it depends on the characteristics of the forecasts and the time series that change (i.e., the forecasting models/methods used and the resulting forecast errors), so at each temporal level it is reasonable to expect a different cross-sectional G. This is exactly what causes the problem. Not all G‘s can be true at the same time and ensure cross-temporal coherence.

The practical way forward is very simple, which improves forecast accuracy and imposes cross-temporal coherence:

  1. Produce temporal hierarchies forecasts for each time series of the cross-sectional hierarchy.
  2. Model the cross-sectional hierarchy at all temporal levels (these are reconciled temporally already).
  3. Collect all the different G‘s and calculate their element-wise average G*.
  4. Use the common G* to reconcile cross-sectionally, which by construction respects all temporal reconciliations.

The calculation is fairly trivial, although a large number of forecasts is required to be produced. Nowadays the latter is typically not an issue.

4. Does it work?

A recently published paper demonstrates the accuracy gains. Without going into too much detail, as one can find all that in the linked paper the key takeaways are:

  • The major forecast accuracy benefits come from temporal hierarchies.
  • Cross-sectional hierarchies still improve accuracy, but to a lesser extent.
  • The second hierarchical reconciliation is bound to offer fewer improvements, as the forecasts are already more accurate and therefore closer to the ideal ones that in principle are already coherent.
  • In our experiments, total gains were up to 10% accuracy. Obviously, this is dataset and setup dependent.

One may argue that this may be too much work for 10% accuracy. The strength of this argument depends on the quality of the base initial forecasts and also in the fact that the accuracy gains are a “by the way” benefit. The true purpose is to produce cross-sectionally coherent forecasts. These forecasts provide the same aligned view of the future across all dimensions, so they truly represent the one number forecast!

The forecasts are now aligned across:

  • Planning horizons: short/long, for high and low-frequency data (e.g., from hourly to yearly observations).
  • Planning units: SKUs to the most detailed level used, up to the total of the whole organisation.
  • The cross-temporal hierarchy respects the decision making needs: it provides detailed per SKU short-term high-frequency forecasts and aggregate long-term low-frequency forecasts. And these are coherent. No matter how you split or join forecasts together, they still agree.

The real benefit is that people can supplement different types of forecasts to the cross-temporal machinery. Back to the initial examples, the inventory side, the marketing side and the finance side keep on doing their work and provide their expert, model and information specific, views about the future. Crucially, this can be done in a modular fashion. An organisation does not have to go online with the whole construct simultaneously, but different functions can join step-by-step simply by revising the hierarchy to include that view.

In practice, one would use different type of models and inputs at the different parts of the cross-temporal hierarchy. For higher levels leading indicators, other regressors and expert judgment will be helpful. At lower levels, due to their size, univariate reliable forecasts, for example, based on exponential smoothing, potentially augmented by judgement, would be better suited.

5. The organisational benefits

An aligned view of the future across all levels/functions/horizons of an organisation comes with the apparent benefits for decision making. There are four more benefits that may not be apparent immediately:

  1. Break information silos the analytics way: it is not easy to change corporate structures, culture or human nature to improve communication between teams and functions. It is not easy to have colleagues who do not do forecasting for living to sit into long meetings about improving forecasts. The beauty of cross-temporal hierarchies is that forecasts can be produced independently and are subsequently weighted according to their quality. None of the views is discarded, but all are considered, with their different information base and distilled expert knowledge, to the single coherent forecast. Subsequently, information silos are softened as all functions and teams plan on a common blended view of the future.
  2. From strategising operations to informed strategies: the traditional managerial mantra is about how to operationalise strategies, i.e. how to take top-level decisions and vision about the future to the rest of the organisation. In principle that is all fine, if the top management had transparency of operations. A single page report, a line graph or a pie chart just doesn’t cut it! Cross-temporal hierarchies allow taking into consideration both top-down and bottom-up views, both short-term objectives/needs and long-term strategies/visions. This creates data transparency. Top management can generate a view of the future and then inform it with the rest of the organisational knowledge. Operations are closer to the customer. Marketing is shaping the customer. But neither operations or marketing have the bird-eye view of the board. And these are just examples.
  3. Ultra-fast decision making: welcome to a world where Artificial Intelligence decides for you. AI is not yet able to replace human decision making fully, but it is surely able to take care of many tedious decisions and do these at very large scales and very fast. It is only logical (obviously I had to use this phrase when talking about AI!) that we will see increasing use of AI to interact with customers at an increasingly high frequency. The scale can easily become of a level that it is impossible for human decision makers/planners/operators to supervise effectively. More importantly, if experts disengage from this, there is a good chance that the company will not be able to use all the expertise and experience in the (human) workforce. Cross-temporal hierarchies can help with that. AI will be able to take decisions and use data at ultra-fast frequencies. Humans do not need to follow that, as they can supplement with their views, knowledge and information with lower frequency decision making. Cross-temporal hierarchies will blend the two together, with AI adding additional levels to the hierarchical structure.
  4. Collaboration: thinking out of the box in a literal way. The cross-temporal machinery does not have to be restricted to a single organisation, but can encompass multiple. This way multiple units and stakeholders can share information and have a common view of the future. In the aforementioned paper, in the conclusions, we provide an example about the tourism sector, where hotel units, satellite companies and the state tourism board can all collaborate through a cross-temporal hierarchy.

Admittedly, each of the four points raised requires increasing analytics and corporate maturity. These are my views about how business will change, and my expectation is that this will happen rather quickly. Point 1 is apparent. Point 2 is necessary as employees become better skilled, better informed and better educated. If you want these people to remain part of your organisation, you can surely assume that top-down and traditional strategising operations will not be satisfactory. Point 3 is bound to happen, led by the large companies who already invest heavily in AI. But the interesting thing about AI is that its cost is reducing substantially and very fast, making it accessible to more and more organisations. Point 4 may be somewhat more contentious. What about competition between units and companies? My view is that collaborative existence is the only way forward for many small to medium size organisations, if they are to survive. How this is done, and what would be the involvement of larger players and the public is to be seen and surely a topic for a different discussion! This post is already too long!

Happy forecasting!

Cross-temporal coherent forecasts for Australian tourism

Nikolaos Kourentzes and George Athanasopoulos, 2019, Annals of Tourism Research, Vol 75, Pages 393-409.

Key to ensuring a successful tourism sector is timely policy making and detailed planning. National policy formulation and strategic planning requires long-term forecasts at an aggregate level, while regional operational decisions require short-term forecasts, relevant to local tourism operators. For aligned decisions at all levels, supporting forecasts must be `coherent’, that is they should add up appropriately, across relevant demarcations (e.g., geographical divisions or market segments) and also across time. We propose an approach for generating coherent forecasts across both cross-sections and planning horizons for Australia. This results in significant improvements in forecast accuracy with substantial decision making benefits. Coherent forecasts help break intra- and inter-organisational information and planning silos, in a data driven fashion, blending information from different sources.

Download paper.

Tutorial for the nnfor R package

The nnfor (development version here) package for R facilitates time series forecasting with Multilayer Perceptrons (MLP) and Extreme Learning Machines (ELM). Currently (version 0.9.6) it does not support deep learning, though the plan is to extend this to this direction in the near future. Currently, it relies on the neuralnet package for R, which provides all the machinery to train MLPs. The training of ELMs is written within the nnfor package. Note that since neuralnet cannot tap on GPU processing, large networks tend to be very slow to train. nnfor differs from existing neural network implementations for R in that it provides code to automatically design networks with reasonable forecasting performance, but also provide in-depth control to the experienced user. The automatic specification is designed with parsimony in mind. This increases the robustness of the resulting networks, but also helps reduce the training time.

Forecasting with MLPs

With the nnfor package you can either produce extrapolative (univariate) forecast, or include explanatory variables as well.

Univariate forecasting

The main function is mlp(), and at its simplest form you only need to input a time series to be modelled.

fit1 <- mlp(AirPassengers)
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,10,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 7.4939.

The output indicates that the resulting network has 5 hidden nodes, it was trained 20 times and the different forecasts were combined using the median operator. The mlp() function automatically generates ensembles of networks, the training of which starts with different random initial weights. Furthermore, it provides the inputs that were included in the network. This paper discusses the performance of different combination operators and finds that the median performs very well, the mode can achieve the best performance but needs somewhat larger ensembles and the arithmetic mean is probably best avoided! Another interesting finding in that paper is that bagging (i.e. training the network on bootstrapped series) or using multiple random training initialisations results in similar performance, and therefore it appears that for time series forecasting we can avoid the bootstrapping step, greatly simplifying the process. These findings are embedded in the default settings of nnfor.

You can get a visual summary by using the plot() function.


The grey input nodes are autoregressions, while the magenta ones are deterministic inputs (seasonality in this case). If any other regressors were included, they would be shown in light blue.

The mlp() function accepts several arguments to fine-tune the resulting network. The hd argument defines a fixed number of hidden nodes. If it is a single number, then the neurons are arranged in a single hidden node. If it is a vector, then these are arranged in multiple layers.

fit2 <- mlp(AirPassengers, hd = c(10,5))

We will see later on how to automatically select the number of nodes. In my experience (and evidence from the literature), conventional neural networks, forecasting single time series, do not benefit from multiple hidden layers. The forecasting problem is typically just not that complex!

The argument reps defines how many training repetitions are used. If you want to train a single network you can use reps=1, although there is overwhelming evidence that there is no benefit in doing so. The default reps=20 is a compromise between training speed and performance, but the more repetitions you can afford the better. They help not only in the performance of the model, but also in the stability of the results, when the network is retrained. How the different training repetitions are combined is controlled by the argument comb that accepts the options median, mean, and mode. The mean and median are apparent. The mode is calculated using the maximum of a kernel density estimate of the forecasts for each time period. This is detailed in the aforementioned paper and exemplified here.

The argument lags allows you to select the autoregressive lags considered by the network. If this is not provided then the network uses lag 1 to lag m, the seasonal period of the series. These are suggested lags and they may not stay in the final networks. You can force that using the argument keep, or turn off the automatic input selection altogether using the argument sel.lag=FALSE. Observe the differences in the following calls of mlp().

mlp(AirPassengers, lags=1:24)
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,4,7,8,9,10,11,12,13,18,21,23,24)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 5.6388.
mlp(AirPassengers, lags=1:24, keep=c(rep(TRUE,12), rep(FALSE,12)))
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,9,10,11,12,13,18,21,23,24)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 3.6993.
mlp(AirPassengers, lags=1:24, sel.lag=FALSE)
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 1.7347.

In the first case lags (1,2,4,7,8,9,10,11,12,13,18,21,23,24) are retained. In the second case all 1-12 are kept and the rest 13-24 are tested for inclusion. Note that the argument keep must be a logical with equal length to the input used in lags. In the last case, all lags are retained. The selection of the lags heavily relies on this and this papers, with evidence of its performance on high-frequency time series outlined here. An overview is provided in: Ord K., Fildes R., Kourentzes N. (2017) Principles of Business Forecasting 2e. Wessex Press Publishing Co., Chapter 10.

Note that if the selection algorithm decides that nothing should stay in the network, it will include lag 1 always and you will get a warning message: No inputs left in the network after pre-selection, forcing AR(1).

Neural networks are not great in modelling trends. You can find the arguments for this here. Therefore it is useful to remove the trend from a time series prior to modelling it. This is handled by the argument difforder. If difforder=0 no differencing is performed. For diff=1, level differences are performed. Similarly, if difforder=12 then 12th order differences are performed. If the time series is seasonal with seasonal period 12, this would then be seasonal differences. You can do both with difforder=c(1,12) or any other set of difference orders. If difforder=NULL then the code decides automatically. If there is a trend, first differences are used. The series is also tested for seasonality. If there is, then the Canova-Hansen test is used to identify whether this is deterministic or stochastic. If it is the latter, then seasonal differences are added as well.

Deterministic seasonality is better modelled using seasonal dummy variables. by default the inclusion of dummies is tested. This can be controlled by using the logical argument allow.det.season. The deterministic seasonality can be either a set of binary dummies, or a pair of sine-cosine (argument det.type), as outlined here. If the seasonal period is more than 12, then the trigonometric representation is recommended for parsimony.

The logical argument outplot provides a plot of the fit of network.

The number of hidden nodes can be either preset, using the argument hd or automatically specified, as defined with the argument By default this is"set" and uses the input provided in hd (default is 5). You can set this to"valid" to test using a validation sample (20% of the time series), or"cv" to use 5-fold cross-validation. The number of hidden nodes to evaluate is set by the argument hd.max.

fit3 <- mlp(AirPassengers,"valid",hd.max=8)
## MLP fit with 4 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,10,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 14.2508.

Given that training networks can be a time consuming business, you can reuse an already specified/trained network. In the following example, we reuse fit1 to a new time series.

x <- ts(sin(1:120*2*pi/12),frequency=12)
mlp(x, model=fit1)
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,10,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 0.0688.

This retains both the specification and the training from fit1. If you want to use only the specification, but retrain the network, then use the argument retrain=TRUE.

mlp(x, model=fit1, retrain=TRUE)
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,10,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 0.

Observe the difference in the in-sample MSE between the two settings.

Finally, you can pass arguments directly to the neuralnet() function that is used to train the networks by using the ellipsis ....

To produce forecasts, we use the function forecast(), which requires a trained network object and the forecast horizon h.

frc <- forecast(fit1,h=12)
##              Jan         Feb         Mar         Apr         May
## 1961 447.3392668 421.2532515 497.5052166 521.1640683 537.4031708
##              Jun         Jul         Aug         Sep         Oct
## 1961 619.5015908 707.8790407 681.0523280 602.8467629 529.0477736
##              Nov         Dec
## 1961 470.8292734 517.6762262

The plot of the forecasts provides in grey the forecasts of all the ensemble members. The output of forecast() is of class forecast and those familiar with the forecast package will find familiar elements there. To access the point forecasts use frc$mean. The frc$all.mean contains the forecasts of the individual ensemble members.

Using regressors

There are three arguments in the mlp() function that enable to use explanatory variables: xreg, xreg.lags and xreg.keep. The first is used to input additional regressors. These must be organised as an array and be at least as long as the in-sample time series, although it can be longer. I find it helpful to always provide length(y)+h. Let us suppose that we want to use a deterministic trend to forecast the time series. First, we construct the input and then model the series.

z <- 1:(length(AirPassengers)+24) # I add 24 extra observations for the forecasts
z <- cbind(z) # Convert it into a column-array
fit4 <- mlp(AirPassengers,xreg=z,xreg.lags=list(0),xreg.keep=list(TRUE),
            # Add a lag0 regressor and force it to stay in the model
            difforder=0) # Do not let mlp() to remove the stochastic trend
## MLP fit with 5 hidden nodes and 20 repetitions.
## Univariate lags: (1,4,5,8,9,10,11,12)
## 1 regressor included.
## - Regressor 1 lags: (0)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 32.4993.

The output reflects the inclusion of the regressor. This is reflected in the plot of the network with a light blue input.


Observe that z is organised as an array. If this is a vector you will get an error. To include more lags, we expand the xreg.lags:

## MLP fit with 5 hidden nodes and 20 repetitions.
## Univariate lags: (1,4,5,8,9,10,11,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 48.8853.

Observe that nothing was included in the network. We use the xreg.keep to force these in.

## MLP fit with 5 hidden nodes and 20 repetitions.
## Univariate lags: (1,4,5,8,9,10,11,12)
## 1 regressor included.
## - Regressor 1 lags: (1,2,3)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 32.8439.

Clearly, the network does not like the deterministic trend! It only retains it, if we force it. Observe that both xreg.lags and xreg.keep are lists. Where each list element corresponds to a column in xreg. As an example, we will encode extreme residuals of fit1 as a single input (see this paper for a discussion on how networks can code multiple binary dummies in a single one). For this I will use the function residout() from the tsutils package.

if (!require("tsutils")){install.packages("tsutils")}
loc <- residout(AirPassengers - fit1$fitted, outplot=FALSE)$location
zz <- cbind(z, 0)
zz[loc,2] <- 1
fit5 <- mlp(AirPassengers,xreg=zz, xreg.lags=list(c(0:6),0),xreg.keep=list(rep(FALSE,7),TRUE))
## MLP fit with 5 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,10,12)
## 1 regressor included.
## - Regressor 1 lags: (0)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## MSE: 7.2178.

Obviously, you can include as many regressors as you want.

To produce forecasts, we use the forecast() function, but now use the xreg input. The way to make this work is to input the regressors starting from the same observation that was used during the training of the network, expanded as need to cover the forecast horizon. You do not need to eliminate unused regressors. The network will take care of this.

frc.reg <- forecast(fit5,xreg=zz)

Forecasting with ELMs

To use Extreme Learning Machines (EMLs) you can use the function elm(). Many of the inputs are identical to mlp(). By default ELMs start with a very large hidden layer (100 nodes) that is pruned as needed.

fit6 <- elm(AirPassengers)
## ELM fit with 100 hidden nodes and 20 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3,4,5,6,7,8,10,12)
## Deterministic seasonal dummies included.
## Forecast combined using the median operator.
## Output weight estimation using: lasso.
## MSE: 91.288.

Observe that the plot of the network has some black and some grey lines. The latter are pruned. There are 20 networks fitted (controlled by the argument reps). Each network may have different final connections. You can inspect these by using plot(fit6,1), where the second argument defines which network to plot.

for (i in 1:4){plot(fit6,i)}

How the pruning is done is controlled by the argument.type The default option is to use LASSO regression (type=“lasso”). Alternatively, use can use “ridge” for ridge regression, “step” for stepwise OLS and “lm” to get the OLS solution with no pruning.

The other difference from the mlp() function is the barebone argument. When this is FALSE, then the ELMs are built based on the neuralnet package. If this is set to TRUE then a different internal implementation is used, which is helpful to speed up calculations when the number of inputs is substantial.

To forecast, use the forecast() function in the same way as before.

##              Jan         Feb         Mar         Apr         May
## 1961 449.0982276 423.0329121 455.7643540 488.5096713 499.3616506
##              Jun         Jul         Aug         Sep         Oct
## 1961 567.7486203 645.6309202 622.8311280 543.9549163 495.8305278
##              Nov         Dec
## 1961 429.9258197 468.7921570

Temporal hierarchies and nnfor

People who have followed my research will be familiar with Temporal Hierarchies that are implemented in the package thief. You can use both mlp() and elm() with thief() by using the functions mlp.thief() and elm.thief().

if (!require("thief")){install.packages("thief")}
thiefMLP <- thief(AirPassengers,forecastfunction=mlp.thief)
# Similarly elm.thief:
thiefELM <- thief(AirPassengers,forecastfunction=elm.thief)

This should get you going with time series forecasting with neural networks.

Happy forecasting!

R package: tsutils

The tsutils package for R includes functions that help with time series exploration and forecasting, that were previously included in the TStools package that is only available on github. The name change was necessary as there is another package on CRAN with the same name. The objective of TStools is to provide a development and testing ground for various functions developed by my group that has led to the creation of various packages over time, such as smooth, nnfor and now tsutils.

The package provides functions to support the following tasks:

  1. tests and visualisations that can help the modeller explore time series components and perform decomposition;
  2. modelling shortcuts, such as functions to construct lagmatrices and seasonal dummy variables of various forms;
  3. an implementation of the Theta method;
  4. tools to facilitate the design of the forecasting process, such as ABC-XYZ analyses;
  5. “quality of life” tools, such as treating time series for trailing and leading values.

We will look at some of these functions here, to show you how tsutils can be helpful in your forecasting projects.

Time series exploration

Exploring your time series is an important step in modelling them. Some standard questions that one has to consider is whether a time series exhibits trend and/or seasonality. First, we deal with the existence of trend. To look into this, it is useful to extranct the Centred Moving Average, and to do this you can use the functon cmav().

cma <- cmav(referrals) # referrals in a series included in the tsutils package.

The cmav() function offers a few useful arguments. By default it takes the length of the moving average from the time series information and sets it equal to its frequency. The user can override this by setting the argument ma. Another useful argument is fill=c(TRUE,FALSE) that instructs the function to fill the first and last observations of the centred moving average that are ommited in the calculation, due to the “centering”. This is done using exponential smoothing (ets() from the forecast package). This is done by default. The argument fast restricts the number of exponential smoothing models considered to speed up the calculation. Finallly the argument outplot=c(FALSE,TRUE) provides a plot of the time sereis and the calculated centred moving average.The CMA can help us identify whether there is a trend visually (see this paper for a discussion), but also to help us with statistical testing.

The function trendtest() provides a logical (TRUE/FALSE) answer for the presence of trend in a time series. It offers two alternatives, controlled by the argument type, which can take the values "aicc" or "cs" that tests by either fitting alternative exponential smoothing models and comparing their AICc or by using the Cox-Stuart test at 5% significance (you can run the test on its own using the function coxstuart()). The test can be conducted on the raw time series, or its CMA, the latter providing a cleaner signal of the low frequency components of the series, a major component of which is the trend. This is controlled by the argument extract. When this is TRUE the test is conducted on the CMA.

# Test on the CMA
trendtest(referrals, extract=TRUE)
## [1] TRUE
# Test on the raw time series
trendtest(referrals, extract=FALSE)
## [1] FALSE

We can see that testing on the unfiltered data provides a different response. A further (experimental) argument is mta that instructs the function to use Multiple Temporal Aggregation, as proposed in the paper that introduces MAPA. Some evidence of the effect of this is provided here, where it increases the accuracy of both approaches, AICc or Cox-Stuart testing.

The function seasplot() is handy for exploring the seasonality of the time series. This provides various arguments for controlling the result. You can manually set the seasonal length (ma – otherwise it uses frequency(y)), the starting period of the seasonal cycle (s – also picked up from the time series object, if available), assume trend being present, absent or test (trend), type of decomposition when trend is present (decomposition) and using precaclculated CMA for the decomposition (cma – use NULL for calculating internally). If there is trend in the time series (as set by the user, or tested) then the time series is first de-trended according to the decomposition setting and then the seasonal plot is produced.

There are a number of arguments to control the visualisation of seasonality. The argument outplot takes the following options:

  • 0, no plot;
  • 1, seasonal diagram;
  • 2, seasonal boxplots;
  • 3, seasonal subseries;
  • 4, seasonal distribution;
  • 5, seasonal density.

In addition, one can provide various arguments to the plot function (consider in the ... input), override the colouring (colour) and the labels (labels).

seasplot(referrals)## Results of statistical testing
## Evidence of trend: TRUE  (pval: 0)
## Evidence of seasonality: TRUE  (pval: 0.002)

The figure below exemplifies the available ploting alternatives.The function provides some useful results, organised in a list: the trend and seasonal compoonents, the logical result of testing for their presence and the p-value of the tests. The trend is tested using the coxstuart() function, while the seasonality is tested by applying the Friedman test on the data as organised for `outplot=2’, i.e. is there evidence that at least one of the ranks of the distributions of the seasonal indices is significantly different from the rest?

Finally, the function decomp() provides classical time series decomposition. Many of the arguments are similar to the seasplot function. Note that type now defines the method that is used to calculate the seasonal indices, which can be the arithmetic mean, median or a “pure seasonal” exponential smoothing model. That is, simply applying an exponential smoothing on the seasonal indices to smooth any randomness. As the later makes it complicated to extrapolate the seasonal component, the function accepts the argument h to provide predicted seasonal indices for h-steps ahead. Additionally for the calculation of the arithmetic mean one can winsorise extremes using the argument w.

dc <- decomp(referrals,outplot=TRUE)

## [1] "trend"     "season"    "irregular" "f.season"  "g"

You can see that the output of the function is a list containing the various time series components. Note that decomp() does not test for the presence of seasonality will always extract a seasonal component.

Time series modelling

There are a few fuctions that are designed with regression modelling in mind. The function lagmatrix() takes an input variable and constructs leads and lags as required.

lags <- lagmatrix(referrals, c(0,1,3:5))
##         [,1]   [,2]   [,3]   [,4]   [,5]
##  [1,] 933976     NA     NA     NA     NA
##  [2,] 890852 933976     NA     NA     NA
##  [3,] 912027 890852     NA     NA     NA
##  [4,] 955852 912027 933976     NA     NA
##  [5,] 821916 955852 890852 933976     NA
##  [6,] 928176 821916 912027 890852 933976
##  [7,] 979219 928176 955852 912027 890852
##  [8,] 853812 979219 821916 955852 912027
##  [9,] 792731 853812 928176 821916 955852
## [10,] 913787 792731 979219 928176 821916

The first column is the original variable (lag 0), while the other columns are the requested lagged realisations. Leads are constructed by using negative numbers for the argument lag.

The function seasdummy() creates seasonal dummy variables that can be either binary or trigonometric, as defined by the argument type=c("bin","trg"). Other arguments are the number of observations to create (n), the seasonal periodicity (m), a time series (y) that is an optional argument and can be useful to get the seasonal periodicity and the starting period information. The latter is useful when the seasonality is trigonometric to get the phase of the sines and cosines correctly. Argument full retains or drops the m-th dummy variable. For linear models that would typically be set to full=FALSE. For nonlinear models this becomes a somewhat more interesting modelling question. Some evidence for neural networks is provided in this paper.

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
##  [1,]    1    0    0    0    0    0    0    0    0     0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0     0
##  [7,]    0    0    0    0    0    0    1    0    0     0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0     0
## [10,]    0    0    0    0    0    0    0    0    0     1     0

For example, one can use these two functions to quick construct dynamic regression models.

Xl <- lagmatrix(referrals,c(0:2))
Xd <- seasdummy(n=length(referrals),y=referrals)
X <-,Xd))
colnames(X) <- c("y",paste0("lag",1:2),paste0("d",1:11))
fit <- lm(y~.,X)
## Call:
## lm(formula = y ~ ., data = X)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -99439 -43116   7114  38919  98649 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.808e+05  1.695e+05   2.836  0.00672 ** 
## lag1        1.469e-01  1.337e-01   1.098  0.27760    
## lag2        1.799e-01  1.397e-01   1.288  0.20394    
## d1          1.751e+05  3.820e+04   4.584 3.38e-05 ***
## d2          1.867e+05  4.161e+04   4.488 4.64e-05 ***
## d3          2.445e+05  3.485e+04   7.015 7.75e-09 ***
## d4          1.285e+05  4.204e+04   3.056  0.00369 ** 
## d5          1.840e+05  3.892e+04   4.727 2.10e-05 ***
## d6          2.020e+05  3.709e+04   5.447 1.83e-06 ***
## d7          2.054e+05  3.621e+04   5.673 8.39e-07 ***
## d8          1.139e+05  3.613e+04   3.152  0.00282 ** 
## d9          1.698e+05  3.604e+04   4.713 2.21e-05 ***
## d10         2.168e+05  3.749e+04   5.783 5.73e-07 ***
## d11         1.616e+05  3.658e+04   4.418 5.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 56250 on 47 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5882, Adjusted R-squared:  0.4743 
## F-statistic: 5.165 on 13 and 47 DF,  p-value: 1.365e-05

Obviously, one might prefer to have the seasonal dummies as a factor variable, when these are to be coded as binary dummy variables. Users of the nnfor package may recognise these two functions. In the next release of that package, both functions will be moved to the tsutils (this is already done in the development version).

The function residout() looks at model residuals from a control chart point of view and identifies outlying ones. You can manually set the threshold over which the standardised residuals will be considered outlier (t) and whether to produce a plot (outplot).

rsd <- residout(residuals(fit))

If there are outliers, their location and values are provided in the outputted list.

err <- rnorm(20)
err[5] <- err[5]+5
## [1] 5

The idea behind this function is to identify outliers so that the modeller can construct dummies to account for them or manipulate them otherwise.

A somewhat more exotic function is getOptK that provides the optimal temporal aggregation level for AR(1), MA(1) and ARMA(1,1) processes. The user defines the type of data generating process that we expect (type=c("ar","ma","arma")) and the maximum aggregation level (m). The output is the theoretically optimal aggregation level.

getOptK(referrals) # This time series is clearly not an AR(1)!
## [1] 9

For a discussion of the pros and cons of this theoretical calculation see this paper. My personal view is that Multiple Temporal Aggregation as implemented in MAPA or Temporal Hierarchies (thief) is a less risky modelling approach and therefore more desirable for practice.

The function theta() provides an implementation of the classical Theta method, with a few modifications. The workflow matches the one in the forecast package:

fit <- theta(referrals)
frc <- forecast.theta(fit,h=20)

This implementation of the Theta method tests for seasonality and trend. Seasonal decomposition can be done either additively or multiplicatively and the seasonality is treated as a pure seasonal model. The various Theta components can be optimised using different cost functions. The originally proposed Theta method always assumed multiplicative seasonality and presence of trend, while all theta lines were optimised using MSE and seasonality was estimated using classical decomposition and arithmetic mean. The various arguments of the function can control how theta() operates: m can be used to manually set the seasonal length, sign.level the significance level for the statistical tests, cost0 to cost2 the cost function for the various components of the model, multiplicative the type of decomposition used and cma an optional pre-calculated Centred Moving Average used for decomposition and testing. Also note that since impolementation is not based on a model, it does not provide prediction intervals or information criteria.

If you are a user of thief (if not give it a try!) the function theta.thief() integrates both packages:

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2013                                                             1035418.8
## 2014  955320.7  957015.3 1067922.5  984519.0 1019637.9 1027866.9 1040235.2
## 2015  959464.4  961167.2 1072865.8  988823.4 1024207.9 1032495.2          
##            Aug       Sep       Oct       Nov       Dec
## 2013  943204.9  991756.3 1031436.6  987854.5  819194.1
## 2014  947299.6  996207.8 1036200.5  992380.8  822408.6
## 2015

In this case temporal hierarchy forecasts are constructed using the Theta method, as implmemented in tsutils, at all temporal aggregation levels.

One of the major tasks in producing the appropriate forecasts for a time series is evaluating the performance of different models/methods. The nemenyi() function is helpful to perform nonparametric multiple comparisons testing. There are two considerations. First, some difference in summary performance statistics does not always mean that one forecast is indeed preferable, as differences may still be due to randomeness, and more evidence is necessary. Second, doing repeated pairwise testing between methods does not result in the desired significance in the results, as by doing this repetition is distorting your testing probabilities. The nemenyi() function first performs a Friedman test to establish that at least one method performs significantly different and then uses the post-hoc Nemenyi test to group methods by similar performance. Both tests are nonparametric, and therefore the distribution of the performance metric is not an issue.

x <- matrix( rnorm(50*4,mean=0,sd=1), 50, 4) # Generate some performance statistics
x[,2] <- x[,2]+1 
x[,3] <- x[,3]+0.7
x[,4] <- x[,4]+0.5
colnames(x) <- c("Method A","Method B","Method C - long name","Method D")
## Friedman and Nemenyi Tests
## The confidence level is 5%
## Number of observations is 50 and number of methods is 4
## Friedman test p-value: 2e-04 - Ha: Different
## Critical distance: 0.6633

The plot is fairly easy to read. Any for methods that are joined by a horizontal there is no evidence of statistically significant differences. There are multiple lines, as we can start measuring from different methods. The test is simple. For any method that is within plus/minus the critical distance (CD in the plot) from the mean rank (the values next to method names) there is no evidence of differences. The function provides different plot styles, using the argument plottype. This can be used to minic the “multiple comparison with the best” (plottype="mcb" or plottype="vmcb" for vertical alternative), the “line plot” above (plottype="line" or plottype="vline"), a detailed comparison (plottype="matrix") or output no visualisation (plottype="none").

The mcb visualisation shows the performance of the various methods seperately. The dot in the middle is the mean rank of the method and the lines span te critical distance (half above and half below). For any other method that its boundaries are crossed there is no evidence of significant differences. The matrix shows on the vertical axis all methods sorted according to mean rank. On the horizontal axis they are sorted as in the input array. The balck cell is the method from which we measure and in a row (or column) any methods in blue belong to the same group of perofrmance. For example, in all the plots above, method A has no differences from d and c. Method d similarly is grouped together with a and c. When we measure from c, on the other hand, we have no evidence for significant differences from any method. When we measure from b, then both a and d have significant differences. If we are interested in the best performing group then the mcb visualisation is the simplest to consult. Otherwise matrix is the most detailed and line/vline is a compromise between the two.

Forecasting process modelling

Beyond various modelling functions, the package tsutils offers some functions for analysing the forecasting process. More specifically it helps to conduct ABC-XYZ analyses. This is done using the functions abc(), xyz() and abcxyz().

Starting with the ABC part of the analysis, given an array of time series, where each column is a time series, you can use:

x <- abs(matrix(cumsum(rnorm(5400,0,1)),36,150)) # Construct some random data
ABC <- abc(x)
## ABC analysis
##          Importance %
## A (20%): 39.678
## B (30%): 43.061
## C (50%): 17.261

We can also plot the output, providing the concentration in each class.


Alternatively, the input can be a vector that contains the importance index for each series. The reader is referred to Ord K., Fildes R., Kourentzes N. (2017) Principles of Business Forecasting, 2e. Wessex Press Publishing Co., p.515-518 for details of how to use the ABC analysis, as well as the XYZ analysis that follows.

The user can set custom percentages for each class, or change the number of classes with the argument prc. For example:


The function xyz() looks at the forecastability side. In addition to the arguments in abc() the user can select the type of forecast used to generate errors (type) and the seasonal periodicity of the series (m). For a discussion see the aforementioned reference and here. By default the naive and seasonal naive methods are used (type="naive"), but the user can opt to rely on exponential smoothing (type="ets"). The option to use the coefficient of variation is available (type="cv"), but I strongly advise against using it (see the previous references).

XYZ <- xyz(x,m=12)

One can combine both analyses with the function abcxyz(). This can also accept a vector of forecast errors (with the argument error) that are distributed to each class of items. This vector must be equal length to the columns used to construct the ABC and XYZ analyses.


## $class
##    Z  Y  X
## A  0  0 30
## B  0  0 45
## C 30 45  0
## $error

Quality of life functions

The package includes some QoL functions. The function geomean() calculates the geometric mean and accepts the same arguments as the mean().

## [1] 0.9085603
## [1] 1

The function leadtrail() can remove leading or trailing zeros or NAs as requested. This is controlled by the argumentsrm, lead and trail.

x <- c(rep(0,2),rnorm(5),rep(0,2))
## [1]  0.0000000  0.0000000 -0.9519769 -1.1978381 -0.9596107 -1.6142629
## [7]  0.5855328  0.0000000  0.0000000
## [1] -0.9519769 -1.1978381 -0.9596107 -1.6142629  0.5855328

The function wins() winsorises vectors and accepts the argument p to control how many observations are treated. If p is less than 1, then it is interpretted as a percentage. If it is bigger or equal to 1, then it is treated as number of observations.

x <- sort(round(runif(10)*10))
##  [1] 0 1 1 1 3 7 8 8 9 9
##  [1] 1 1 1 1 3 7 8 8 8 8
##  [1] 1 1 1 1 3 7 8 8 9 9

There are two vectorised versions colWins() and rowWins() to speed up calculations for large arrays.

Finally, the function lambdaseq() is meant to be used with LASSO and similar. Given some response target y and regressors x it calculates a sequence of lambdas from the OLS solution to the penalty that removes all regressors from the model. If weighted LASSO will be used, then the argument weigth can beused to account for that. To switch to ridge regression or elastic nets use the argument alpha. How many lambdas are produced (nLambda) and how the lambda decreases (lambdaRatio) can be controlled. The argument standardise preprocess the variables, while the arguement addZeroLamda does as it says and adds the OLS solution lambda in the output vector.

y <- mtcars[,1]
x <- as.matrix(mtcars[,2:11])
lambda <- lambdaseq(x, y)[[1]]

This can be directly used with the glmnet package, or with custom shrinkage estimators (the motivation behind building this function!).

fit.lasso <- cv.glmnet(x, y, lambda = lambda)

Users of TStools will not find any new functionality in tsutils yet. Under the hood there are various improvements and better documentation. Where necessary, the functions point to useful references and examples. Most importantly, these functions are now in CRAN and safe to use! The plan is to keep expanding tsutils with more functions that graduate from the development TStools.

Happy forecasting!

Incorporating Leading Indicators into your Sales Forecasts

Nikolaos Kourentzes and Yves Sagaert, Foresight: The International Journal of Applied Forecasting, 2018, Issue 48.

This is a modified version of the paper that appears in Foresight issue 48. This provides a simplified version of the modelling methodology described in this paper and applied here and here.


Using leading indicators for business forecasting has been relatively rare, partly because our traditional time series methods do not readily allow incorporation of external variables. Nowadays however, we have an abundance of potentially useful indicators, and there is evidence that utilizing relevant ones in a forecasting model can significantly improve forecast accuracy and transparency.

What are leading indicators?

We can define a leading indicator as a numerical variable that contains predictive information for our target variable (e.g., sales) at least as many periods in advance as the forecast lead time. There are several aspects to this definition.

  • First, the indicator should be a hard variable; that is, recorded in a useful way for inputting into a statistical model. This means it should have adequate history and be available at the required frequency, after any appropriate transformations (for example, aggregating data into the desired frequency).
  • Second, the indicator must contain genuine predictive information, not spurious statistical correlation with the target.
  • Third, the indicator must lead the target variable by enough time to be operationally useful. If for example we have a leading indicator that is very informative one month in advance of sales, but we need three-month ahead sales forecast for ordering supplies and materials, this indicator lacks sufficient lead time. (If you are still thinking that it is useful for improving your forecasts, then your forecast lead time is not three but one month – a misjudgment we see all too often!) Note that sometimes it is possible to produce (or source externally) forecasts of indicators, but then poor forecasts here indicators will be damaging! A detailed discussion of the so-called conditional (i.e. using information only up to the time of forecast) and uncoditional forecasting can be found in Principles of Business Forecasting, where alternative forecasting strategies are discussed.

Sources of potential indicators are governmental, banking and private-sector data on macroeconomic, internet search data, social media postings, and an often overlooked source,  the company itself. Strategic, marketing on other company plans, can be useful for constructing leading indicators.

Forecasting with leading indicators

Suppose we have identified a useful leading indicator, how do we use it for forecasting? In the simplest setting, where the target responds linearly to movement in the indicator, we can construct a forecast as Equation (1):

F_{t+h} = M_{t+h} + cX_t (1)

where F_{t+h} is the forecast for period t+h, constructed at period t (now!), M_{t+h} represents the time series components – level, trend, and seasonality – of the target at period t+h exclusive of the effect of the leading indicator, and c is the coefficient on a leading indicator X_t.  In a regression context, M_{t+h} could comprise a constant, a trend variable, a set of seasonal dummies, or lagged values of the target.

Observe that we use index t for X_t, since we require its value to be available at the time we construct the forecast for at least h periods ahead. If the lead were more than h, we would require a forecast for the indicator X. which may introduce more forecast error. Additionally. we might be able to procure forecasts of X from an external source.

Equation (1) can be extended to include many indicators (or lags of the same indicator) in the same fashion as we would include more variables in a multiple regression.

Identifying leading indicators: Lasso regression

So far, we have assumed that we know a priori which indicators are useful. Realistically, this is not the case – so we need an identification strategy for selecting among potentially numerous potential indicators.

Let’s start by restricting our search to macroeconomic leading indicators, which can readily be sourced from a national statistics bureau. There are thousands, maybe tens of thousands of these that could have predictive information. On top of this, we need to consider various lags of the indicators to identify the most informative lead. For example, is it housing sold three or four months in advance that is most informative for our sales? This can quickly bring the number of potential indicators to several hundreds of thousands.

In fact, the number of potential indicator variables k can be larger than the number of available historical observations in our sample n, which makes it impossible to estimate or evaluate a model and rules out classical regression search strategies, such as stepwise regression.

The lasso regression procedure

A promising solution to this problem comes from the technique called Least Absolute Shrinkage and Selection Operator (Lasso) regression. Lasso was introduced by Robert Tibshirani (1996) and since has become a popular tool in many disciplines. (See Ord et al., 2017, section 9.5.2 for a barebone introduction and Hastie et al., 2015 for a thorough discussion) Lasso regression has the major advantage that it can still select and estimate a model when the number of potential indicator variables is greater than the number of historical observations, k>n.

To estimate the model coefficients in standard regression we minimise the Sum of Squared Error (SSE):

SSE=\sum_{t+1}^{n}{(A_t-F_t)^2} (2)

where A_t is the actual observation at time t and F_t is the forecast.  Suppose that the forecast is made up by a constant and k explanatory variables X_{k,t}. For the sake of simplicity, we will drop any lags from the notation, and consider any lagged input as different inputs X. Given that F_t=c_o+\sum_{i=1}^k{c_iX_{i,t}}, (2) becomes:

SSE=\sum_{t=1}^n{(A_t-c_0-\sum_{i=1}^k{c_iX_{i,t}})^2}, (3)

where c_o is the constant and c_i the coefficient on indicator X_{i,t}.

Equation (3) is called the cost function or loss function of the regression. The regression method finds the values for c_o and c_i that minimize the cost function. When a coefficient is insignificantly different from zero, we typically remove that variable from the regression, achieving variable selection.

If k>n, there is no unique solution for the c coefficients, which renders the model useless. This is because we allow full flexibility to the coefficients, which can take on any values. To overcome the problem, we need to impose restrictions on the coefficients. Lasso does so by modifying Equation (3) so that it includes a penalty for model complexity, as measured by the number of regressors in the model with non-zero coefficients. If we think of Equation (3) as “fitting error” we can conceptually write

Error = (Fitting Error) + λ(Penalty). (4)

The reader may realise that Equation (4) appears quite often in time series modelling. Information criteria, such as Akaike’s Information Criterion (AIC) have a similar structure, although a very different underlying logic and derivation. Information criteria seek to mitigate potential overfitting of a model, by imposing penalties/restrictions to the model fit.

In Lasso, penalty is \sum_{i=1}^{k}{|c'_i|}, where c'_i is the coefficient of the standardised variable X_i. The intuition behind this penalty is the following: the more non-zero coefficients (i.e. included variables) the bigger the penalty becomes. Therefore, for equation (4) to be minimised, the penalty must become small as well, which forces variables out of the model. Standardization puts all variables on the same scale, preventing larger-valued variables from dominating just because of their scale rather than their importance in explaining the variation of the target variable.

We can now write the cost function of Lasso as:

 \sum_{t=1}^{n}{(A'_t - \sum_{i=1}^{k}{c'_i X'_{i,t}})^2} + \lambda \sum_{i=1}^{k}{|c'_i|} , (5)

where both actuals and input variables have been standardised, denoted by the prime indicator (‘).

Determining λ

Lasso attempts to minimise Equation (5). If \lambda = 0 Equation (5) reduces to Equation (3), the conventional regression solution. As λ increases above zero, the penalty becomes increasingly dominant, which pushes the c coefficients toward zero (hence the name “shrinkage”). Those variables with zero coefficients will be removed from the model. Intuitively, it is the coefficients on the least important variables that are first pushed to zero, leaving the more important variables in the model to be estimated.  (“Important” does not mean causal, but merely that it explains some of the variance of the historical data.)

So Lasso will select variables— those that remain— and estimate the values of their coefficients. This is true even when k>n.

A “side-effect” of Lasso regression is that we no longer receive statistical-significance values, as we do with conventional regression.  This is so because Equation (5) violates important assumptions underlying the regression model.

Figure 1 illustrates how the coefficients of ten variables – each distinguished by number and color –shrink as  increases (from left to right), until eventually all are removed from the model.

The coefficient values on the very left of the plot are almost identical to the standard regression solution (λ approximatelly zero). This changes as λ increases. For example, imagine a vertical line at λ=0.05. It’s intersection with the lines in the plot reveal the coefficients’ values (on the vertical axis). Some are already equal to zero: those numbered Similarly, if we look at the coefficients for λ=0.15, only one coefficient (number 9) remains non-zero.

Figure 1: Evolution of coefficients for different values of λ. The horizontal lower axis provides the value of λ, where on the left side it starts with λ≈0 and ends with a sufficiently large value to remove all variables. The vertical axis indicates the value of the coefficients. Each variable coefficient is assigned a different colour and are named from 1 to 10.

So the choice of value for λ is very important for the performance of Lasso. But how is λ determined?  The standard procedure is through cross-validation: we separate the historical sample into various subsamples, calculating the cross-validated mean squared error across all subsamples. We then look for λ values that minimises this error. Actually, we often chose a slightly larger λ to avoid overfitting to the cross-validation itself. Figure 2 visualises the result of this process. Here, we ran Lasso regressions for various values of λ – plotted on the horizontal axis – and recorded the cross-validated errors. The right-most vertical line indicates the selected (larger) λ, while the vertical line to its left indicates the minimum cross-validated error.

With λ determined, Lasso proceeds to estimate the coefficients of the remaining variables and at the same time drop any variables accordingly.

Figure 2: Cross-validated Mean Squared Error (MSE; red dots) for various values of λ with standard errors (vertical bars) of MSE. Upper horizontal axis indicates the number of non-zero variables. Lower horizontal axis increases at a logarithmic scale.

A case study

Sagaert et al. (2018) showed how Lasso was applied to identify macroeconomic leading indicators for forecasting a tire manufacturer’s sales at a regional level. More specifically, our focus was on forecasting monthly sales for the raw materials in two types of end products for the regions of EU and the US.

The current company practice uses Holt-Winters exponential smoothing to forecast long term sales based on estimates of the trend and seasonal patterns. Management believed however that there are key economic activity indicators that could yield predictive information. For instance, net transported goods can be a good indicator for tire sales. As more goods are transported, more trucks are used, leading eventually to a higher demand for tires. Obviously, there is a lag between the ramp up of the number of transported goods and need for tires. Therefore, the transported good variable could be an useful leading indicator.

But there are possibly numerous other leading indicators for this company. Using publicly available macroeconomic indicators for multiple countries, from the St. Louis Federal Reserve Economic Data, we collected a set of 67,851 potential indicators. These covered a wide variety of categories: consumption, feedstock, financial, housing, import/export, industrial, labour and transportation.

For each indicator, we considered lags from 1 to 12 months, increasing the number of potential indicators to 814,212! At this point we could choose either to build a fully automatic forecasting model, where the selection of leading indicators is based solely on the selection made by the Lasso, or take advantage of managerial knowledge as well.

We interviewed the management team with the aim of identifying relevant subgroups of indicators. This exercise led to a subgroup of 1,082 indicators from the original 67,851. As the objective was to produce forecasts for the next 12 months, a different number of indicators was usable for forecasting 1-step ahead to 12-steps ahead. For one step ahead all 12,984 (all 12 lags) were used. In contrast for forecasting 12-steps ahead, only 1,082 indicators were available (only those with a lag of 12 months, as shorter lagging values would need to be forecasted). Note that we also tried a fully statistical model, built using all indicators with no managerial information. That model performed slightly worse that the one enhanced by expert knowledge and therefore is not discussed hereafter.

Figure 3: For each forecast horizon, a forecasting model is built. The number of indicators in the input set (bars) and the average number of selected indicators (line) are shown for the 12 different Lasso models, one for each forecast horizon.


We built 12 different Lasso regressions —one for each forecast horizon— using a different number of leading indicators, as shown in Figure 3. These regressions were further enhanced by including seasonal effects and potential lags of sales, all of which were selected automatically by the Lasso regression. The final forecasting models included only a selection of the available input variables.

Figure 3 illustrates, for each forecasting horizon, the number of available indicators and the average number of indicators selected. The most informative indicators selected turned out to be: employment in automobile dealerships, the number of national passenger car registrations and the consumer price index for solid fuel prices, all logical choices for forecasting tire sales.

We constructed the Lasso forecasts using the excellent glmnet package for R, whose functions allow us to quickly build Lasso models, to select the optimal value of λ and to generate the forecasts.

Figure 4 provides the overall Mean Absolute Percentage Error (MAPE) across four time series and up to 12 horizons for tire sales. Company is the current company forecast based on Holt-Winters exponential smoothing. ETS is a more general exponential smoothing approach that considers all versions of level, trend and/or seasonal forms of exponential smoothing. The appropriate form is chosen automatically using the information criterion,AIC.  Lasso is our Lasso regression based on the managerial-selected subset of indicators.

Figure 4: Overall Mean Absolute Percentage Error (MAPE%) for the alternative forecasting methods across the forecast horizon.

The MAPEs are 18.6% for Company, 15.3% for ETS and 15.1% for Lasso.

  • The Company forecast is easily outperformed by ETS, with the exception of very short horizons (1 to 3-steps ahead). So, even without consideration of leading indicators, the company could have improved its forecasting performance by using the exponential smoothing family of models (ETS) that embeds the advancements made in the last 15 years in terms of model forms, parameter estimation, and model selection.
  • Over short to mid-term horizons (up to 6-steps ahead) Lasso offered substantial gains over ETS. At longer horizons Lasso remains competitive to ETS up to 10-steps ahead but fell short at still longer horizons. For short-term forecasts we allowed Lasso to look for short and long leading effects. On the other hand, for long-term forecasts Lasso could look only for long-lead effects, which will be fewer and weaker. For this particular case there was very limited predictive information for leading indicators with leads of 11 or 12 months. At that forecast lead time, information on trend and seasonality was the most important, which was captured effectively by exponential smoothing (ETS).
  • Overall, Lasso proved to be the most accurate forecast method, although the accuracy gain was small compared to ETS. This has significant implications for forecasting practice: ETS contains no leading indicators and can be implemented automatically whereas Lasso considers volumes of data at the planning level.

This approach was also found to translate to inventory benefits. You can find more details here.

Is the use of leading indicators the way forward?

Our study shows that while there is merit in including leading indicators via a Lasso modelling framework for business forecasting, especially with managerial guidance, the practical usefulness of leading indicators is limited to shorter horizons. At very long horizons chances are that there is only univariate information (trend and seasonality) left to model. Appropriately used extrapolative forecasting models remain very important, even for tactical level forecasting.

We certainly now have the computational power to make the use of leading indicators. But at what level of aggregation should they be factored in. In this case study, we modelled aggregate sales series for which macroeconomic effects were expected to be important. At a disaggregate level, how reasonable is it to follow our approach?

Our team has not found evidence that it is worthwhile to build such complex models at the disaggregate level and that univariate forecasts are just as good. This is intuitive, as any macro-effects are lost in the noisy sales of the disaggregate level (for example per SKU sales). Nonetheless, that does not preclude constructing hierarchical forecasts, where at the top levels, macroeconomic indicators can add value, enhancing disaggregate level ETS forecasts.

We used Lasso regression here as our core modelling. Its substantial advantage over other modelling approaches that attempt to condense many potential variables into fewer construct (principal components analysis, or more generally dynamic factor models) is that Lasso outputs the impact of each selected variable and is transparent to the users. It therefore offers greater business insights for managerial adjustments to the forecasts and the eventual acceptance of the forecasts in the organisation.

In other experiments, our team used leading indicators sourced from online search habits of consumers and found that, even if there is a connection, this connection does not manifest itself in the required forecast horizons (Schaer et al., 2018), limiting their value of online indicators for sales forecasting.

So, while there are gains to be made from using leading indicators, we should not be tempted to overly rely on them, when simpler univariate forecasts can do nearly as well. . On the other hand, leading indicators may be able to follow these dynamics and provide crucial forecast accuracy gains. At the end of the day, a model enriched with leading indicators has to make sense!

Incorporating macroeconomic leading indicators in tactical capacity planning

Yves R. Sagaert, Nikolaos Kourentzes, Stijn Du Vuyst, El-Houssaine Aghezzaf and Bram Desmet, 2018, International Journal of Production Economics.

Tactical capacity planning relies on future estimates of demand for the mid- to long-term. On these forecast horizons there is increased uncertainty that the analysts face. To this purpose, we incorporate macroeconomic variables into microeconomic demand forecasting. Forecast accuracy metrics, which are typically used to assess improvements in predictions, are proxies of the real decision associated costs. However, measuring the direct impact on decisions is preferable. In this paper, we examine the capacity planning decision at plant level of a manufacturer. Through an inventory simulation setup, we evaluate the gains of incorporating external macroeconomic information in the forecasts, directly, in terms of achieving target service levels and inventory performance. Furthermore, we provide an approach to indicate capacity alerts, which can serve as input for global capacity pooling decisions. Our work has two main contributions. First, we demonstrate the added value of leading indicator information in forecasting models, when evaluated directly on capacity planning. Second, we provide additional evidence that traditional metrics of forecast accuracy exhibit weak connection with the real decision costs, in particular for capacity planning. We propose a more realistic assessment of the forecast quality by evaluating both the first and second moment of the forecast distribution. We discuss implications for practice, in particular given the typical over-reliance on forecast accuracy metrics for choosing the appropriate forecasting model.

Download paper.

Judgmental Selection of Forecasting Models

Fotios Petropoulos, Nikolaos Kourentzes, Konstantinos Nikolopoulos and Enno Siemsen, 2018, Journal of Operations Management.

In this paper, we explored how judgment can be used to improve the selection of a forecasting model. We compared the performance of judgmental model selection against a standard algorithm based on information criteria. We also examined the efficacy of a judgmental model-build approach, in which experts were asked to decide on the existence of the structural components (trend and seasonality) of the time series instead of directly selecting a model from a choice set. Our behavioral study used data from almost 700 participants, including forecasting practitioners. The results from our experiment suggest that selecting models judgmentally results in performance that is on par, if not better, to that of algorithmic selection. Further, judgmental model selection helps to avoid the worst models more frequently compared to algorithmic selection. Finally, a simple combination of the statistical and judgmental selections and judgmental aggregation significantly outperform both statistical and judgmental selections.

Download paper.

Quantile forecast optimal combination to enhance safety stock estimation

Juan R. Trapero, Manuel Cardos and Nikolaos Kourentzes, 2018, International Journal of Forecasting.

The safety stock calculation requires a measure of the forecast error uncertainty. Such errors are usually assumed Gaussian iid (independent, identically distributed). However, deviations from iid deteriorate the supply chain performance. Recent research has shown that, alternatively to theoretical approaches, empirical techniques that do not rely on the aforementioned assumptions, can enhance the safety stock calculation. Particularly, GARCH models cope with time-varying heterocedastic forecast error, and Kernel Density Estimation do not need to rely on a determined distribution. However, if forecast errors are both time-varying heterocedastic and do not follow a determined distribution, the previous approaches are inadequate. To overcome this, we propose an optimal combination of the empirical methods that minimizes the asymmetric piecewise linear loss function, also known as tick loss. The results show that combining quantile forecasts yields safety stocks with a lower cost. The methodology is illustrated with simulations and real data experiments for different lead times.

Download paper.

Another look at forecast selection and combination: evidence from forecast pooling

Nikolaos Kourentzes, Devon K. Barrow and Fotios Petropoulos, 2018, International Journal of Production Economics.

Forecast selection and combination are regarded as two competing alternatives. In the literature there is substantial evidence that forecast combination is beneficial, in terms of reducing the forecast errors, as well as mitigating modelling uncertainty as we are not forced to choose a single model. However, whether all forecasts to be combined are appropriate, or not, is typically overlooked and various weighting schemes have been proposed to lessen the impact of inappropriate forecasts. We argue that selecting a reasonable pool of forecasts is fundamental in the modelling process and in this context both forecast selection and combination can be seen as two extreme pools of forecasts. We evaluate forecast pooling approaches and find them beneficial in terms of forecast accuracy. We propose a heuristic to automatically identify forecast pools, irrespective of their source or the performance criteria, and demonstrate that in various conditions it performs at least as good as alternative pools that require additional modelling decisions and better than selection or combination.

Download paper.