Author Archives: Nikos

ABC-XYZ analysis for forecasting

The ABC-XYZ analysis is a very popular tool in supply chain management. It is based on the Pareto principle, i.e. the expectation that the minority of cases has a disproportional impact to the whole. This is often referred to as the 80/20 rule, with the classical example that the 80% of the wealth is owned by 20% of the population (current global statistics suggest that 1% of the global population holds more than 50% of the wealth, but that is beyond the scope of this post!).

ABC analysis

Let us first consider the ABC part of the analysis, which ranks items in terms of importance. This previous sentence is intentionally vague on what is importance and what items should be considered. I will first explain the mechanics of ABC analysis and then get back to these. Suppose for now that we measure importance by average (or total) sales over a given period and that we have 100 SKUs (Stock Keeping Units).

To make the example easier to follow I will explain the ideas behind it, but also provide R code to try it out. First let us get some data. I will use the M3-competition dataset that is available in the package Mcomp:

# Let's create a dataset to work with
# Load Mcomp dataset - or install if not present
if (!require("Mcomp")){install.packages("Mcomp")}
# Create a subset of 100 monthly series with 5 years of data
# Each column of array sku is an item and each row a monthly historical sale
sku <- array(NA,c(60,100),dimnames=list(NULL,paste0("sku.",1:100)))
Y <- subset(M3,"MONTHLY")
for (ts in 1:100){
  sku[,ts] <- c(Y[[ts]]$x,Y[[ts]]$xx)[1:60]
}

Now we calculate the mean volume of sales for each SKU and rank them from maximum to minimum:

# Calculate mean sales per SKU
sku.m <- colMeans(sku)
# Order them from largest to smallest
sku.o <- order(sku.m,decreasing=TRUE)
sku.s <- sku.m[sku.o]

Typically in ABC analysis we consider three classes, each containing a percentage of of the items. Common values are: A – 20% top items; B – 30% middle items; and C – 50% bottom items. Given the ranking we have obtained based on mean sales, we can now easily identify which item belongs to which class.

To find the concentration of importance in each class, we can consider the cumulative sales:

# Calculate cumulative mean sales on ordered items
sku.c <- cumsum(sku.s)
# Find concentration per class
abc.c <- sku.c[c(20,50,100)]/sku.c[100]
abc.c[2:3] <- abc.c[2:3] - abc.c[1:2]
abc.c <- array(abc.c,c(3,1),dimnames=list(c("A","B","C"),"Concentration"))
abc.c <- round(100*abc.c,2)
print(abc.c)

This gives as a result the following:

Class	Concentration
A	29.77%
B	34.12%
C	36.10%

You can use the function abc in TStools to do all these calculations quickly and get a neat visualisation of the result (Fig. 1).

Fig 1. ABC analysis on the first 100 monthly series of the M3-competition dataset.

It is easy to see that in this example the concentration for A category items is in fact quite low. The 20% top items correspond to almost 30% of importance in terms of volume of sames. To my experience this is atypical and A category dominates, resulting in curves that saturate much faster.

Let me return to the question of what is importance. In the example above I used mean sales over a period. Although this is very easy to calculate and requires no additional inputs, it is hardly appropriate in most cases. For example consider an item with minimal profit margin that has very high volume of sales and an item with massive profit margin with mediocre volume of sales. Which one is more important? There is no absolute correct answer and it depends on the business context and objectives. Considering the sales value, profit margins or some per-existing indicator of importance that may already be in place, is more appropriate. In short, depending on the criteria we set, we can get any result we want from the ABC analysis, so it is important to choose carefully.

How many classes should we use? Are three enough? Should we use more? What percentages? To answer these questions one has to know why ABC is done for. I have seen companies using 4 classes (ABCD) or even more, however often these are not tied to a clear decision, and therefore I would argue it was of little benefit. Unless your classification is actionable there is limited value you can get out of it. Three classes have the advantage that they separate the assortment in three categories of high, medium, low importance, which is easy to communicate. What about the percentages? Again there is no right or wrong. The 20% cut-off point for the A class originates from the Pareto principle, and the rest follow. Again, if the decision context is known, one might make a more informed decision on the cut-off points, though I would argue that it is the pairs of cut-off and concentration that matter.

The third point that one has to be aware is that ABC analysis is very sensitive to the number of items that goes in the analysis. For example in the previous example if we added another 100 SKUs the previous classification into A, B and C classes would change substantially. The results are always proportional to the number of items included in the analysis. What does this mean for practice? The results of an ABC analysis done for the SKUs in a market segment will not stay the same if we consider the same SKUs in a super-segment that contains more SKUs. A products in a specific market may be C in the overall market. So the scope of the analysis really defines the results. Again, what is the decision that ABC analysis will support?

You may have already spotted that I am somewhat critical of the analysis. Let me summarise the issues. The analysis is very sensitive to the metric of importance, the number of classes and cut-off points, as well as the number of items considered. There is no best solution, as it always depends on the decision context. A final relevant criticism is that ABC analysis provides a snapshot in time and does not show any dynamics. Is an item gaining or losing in importance?

XYZ analysis

The XYZ analysis focuses on how difficult is an item to forecast, with X being the class with easier items and Z the class with the more difficult ones. The perform the XYZ analysis one follows the same logic as for ABC. Therefore the important question is how to define a metric of forecastability. Let me mention here that the academic literature has attempted to put a formula to this quantity. I would argue unsuccessfully. What I will discuss here are far from perfect solutions, but at least have some practical advantages.

Textbooks have supported the use of coefficient of variation. This is so flawed that every time I read it… well, let me explain the issues. The coefficient of variation is a scaled version of the standard deviation of the historical sales. This tells us nothing about the easiness to forecast sales or not. Let me illustrate this with a simple example. Consider an item that has more or less level sales with a lot of variability and an item that has seasonal sales with no randomness whatsoever. The first is difficult to forecast, while the second is as easy as it gets (just copy the previous season as your forecast!). As Fig. 2 illustrates, the coefficient of variation would not indicate this, giving to the seasonal series a higher value.

Fig. 2: Example series that the coefficient of variation fails to indicate which one is more difficult to forecast.

A better measure is forecast errors, which would directly relate to the non-forecastable parts of the series. This introduces a series of different questions: which forecasting method to use? which error metric? should it be in-sample or out-of-sample errors? Again, there is no perfect answer. Ideally we would like to use out-of-sample errors, but that would require us to have a history of forecast errors from an appropriate forecasting method, or conduct a simulation experiment with a holdout.

As for the method, this is perhaps the most complicated question. A single method would not be adequate, the reason being that the same as for coefficient of variation. The latter implies that the forecasting method is the arithmetic mean (the value from which the standard deviation is calculated). An appropriate set of methods should be able to cope with all level, trend and seasonal time series. A simplistic solution is to use naive (random walk) and seasonal naive, with a simplistic selection routine. The difference between seasonal and non-seasonal time series is typically substantial enough to make even weak selection rules work fine. An even better solution is to use a family of models, such as exponential smoothing, and do proper model selection, for instance using AIC or similar information criteria.

The error metric should be robust (do not use percentage errors for this!) and be scale independent. I will not go in the details on this discussion, but instead refer to a recent presentation I gave on the topic. The first few slides from this one should give you an idea of my views.

The function xyz in the TStool package for R allows you to do this part of the analysis automatically, but illustrated for ABC, it is easy to do manually. Fig. 3 provides the result for the same dataset. Similarly we can see what percentage of our assortment is responsible for what percentage of our forecast errors, and so on.

Fig. 3: XYZ analysis on the first 100 monthly series of the M3-competition dataset.

One has to note that the same critiques done for the ABC are applicable to the XYZ analysis as well.

Putting everything together – the ABC-XYZ analysis

Once we have characterised our assortment for both ABC and XYZ classes, we can put these two dimensions of analysis together, as Fig. 4 illustrates.

Fig. 4: ABC-XYZ classification. The first 100 monthly series of the M3-competition are characterised in terms of importance and forecastability.

Let us consider what these classes indicate. I will discuss the four corners of the matrix:

AX: Very important items, but relatively easy to forecast;
CX: Relatively unimportant items that are relatively easy to forecast;
AZ: Very important items that are hard to forecast;
CZ: Relatively unimportant items that are hard to forecast.

In-between classes are likewise easy to interpret. This classification can be quite handy to allocate resources to the forecasting process. Suppose for instance that we have a team of experts adjusting forecasts. It is more meaningful to dedicate time to the lower-left corner of the matrix, rather than the top-right corner in gathering additional information to enrich statistical forecasting. I am not referring to a single cell of the matrix, but to the wider neighbourhood. Alternatively, consider the case that a new forecasting system is implemented. Ideally we would like everything to run smoothly from day 1. Ideally… in practice things go wrong. Again, we would like to be more careful with the lower-left corner of the matrix.

Following the same logic, one would expect that it is easier to improve accuracy on the top part of the matrix, rather than the lower part of the matrix. If we find that we are doing relatively bad in terms of accuracy on AX items, we know that we are messing up on important items, which should be relatively easy to forecast.

I often make the argument for automating the forecasting process using the ABC-XYZ analysis. A large chunk of the assortment (top-right side) can be automated relatively safe, as these are items that are not relatively that crucial and are easier to forecast. Suppose you need to produce forecasts for several thousand items (or even more!), there is no chance you can dedicate equal attention to all forecasts. Similarly, simple alerts should be able to deal with AX products. But AZ products are the difficult to forecast, which we should get right, as they are important. These we should dedicate more resources and potentially difficult to fully-automate (there is adequate evidence in the literature that experts always add value overall).

Concluding remarks

Everything is relative with the ABC-XYZ analysis! I have avoided mentioning even once an error value as a cut-off point to define easy and difficult to forecast. Such logic is flawed, we can reasonably only talk about relative performance and we should not expect same error or importance values to be applicable to different assortments.

I have argued several times that intermittent demand forecasting is a mess. Being consistent in their nature, they also mess up ABC-XYZ analysis. The reason for this is that typically they are low volume and would take over the C class of ABC, pushing other relatively unimportant items to A and B classes (depending how many intermittent items one would permit in the analysis). Furthermore measuring accuracy for intermittent demand forecasting with standard error metrics is wrong, and would typically result in incomparable forecast errors to fast moving items. That would distort the results of XYZ analysis. A good idea is to separate intermittent items from the fast moving items before conducting an ABC-XYZ analysis. Similarly, new products will distort the analysis as well.

ABC-XYZ analysis can be a powerful diagnostic tool, as well as very helpful for allocating resources in the forecasting process. However, if it is not tied to actionable decisions, it is difficult to set it up correctly in terms of what is a good metric for importance or forecastability, how many classes and so on. It certainly is not a magic bullet and suffers from several weaknesses, but which tool does not?

ISIR 2016 research presentations

Last August I attended the International Symposium on Inventories, organised by the International Society for Inventory Research (ISIR 2016). This is the second time I had attended this conference and it has been again a very good experience. Interesting talks on a variety of relevant topics to forecasting research. I organised a special track on estimating demand uncertainty, looking at the link between forecasts, inventories and associated decisions.

I was involved in a number of presentations with my colleagues and students, which you can find here:

Barrow D., Kourentzes N., Petropoulos F., Combining and pooling forecasts based on selection criteria.
Kourentzes N., Tabar B. R., Barrow D. K., Demand forecasting by temporal aggregation: using optimal or multiple aggregation levels?
Sagaert Y. R., De Vuyst S., Kourentzes N., Aghezzaf E-H., Desmet B., Incorporating macro-economic leading indicators in inventory management.
Saoud P., Kourentzes N., Boylan J., Estimating demand uncertainty over multi-period lead times.
Svetunkov I., Kourentzes N., Asymmetric prediction intervals using half moment of distribution.

ISF 2016 research presentations

It has been a while since I got the time to post any updates, but finally got round to it. Last June I attended the International Symposium on Forecasting (ISF) at Santander, which had been very interesting and enjoyable. I had already posted about the presentations I gave. I was involved in a number of presentations spanning different topics that were given by colleagues and PhD students. You can find a complete list these here:

Barrow D. K., Mitrovic A., Holland J., Kourentzes N., Ali M., Developing intelligent tutoring support for teaching business forecasting: the forecasting intelligent tutoring system.
Allen G. P., Fildes R., Kourentzes N., Bias in decadal climate model forecasts.
Kourentzes N., Forecasting with temporal hierarchies.
Kourentzes N., Trapero J. R., Svetunkov I., Measuring forecasting performance: a complex task.
Sagaert Y. R., Aghezzaf E-H., Kourentzes N., Desmet B., Variable selection for long-term forecasting using temporal aggregation.
Schaer O., Kourentzes N., Fildes R., Forecasting demand with internet searches (and social media shares).
Svetunkov I., Kourentzes N., Model parameter estimation with trace forecast likelihood.
Waller D., Boylan J., Kourentzes N., Modelling multiple seasonalities across hierarchical aggregation levels.

With Fotios Petropoulos we also delivered a workshop on forecasting with R.

SAS-IIF Grant to Promote Research on Forecasting 2016

Every year the International Institute of Forecasters (IIF) in collaboration with SAS has been supporting forecasting research with grants up to $5,000. The application deadline for this year is the 30th of September 2016.

Applications must include:

Description of the project, up to 4 pages;
Letter of support from the home institution that the researcher is based at;
Brief CV, up to 4 pages;
Budget and workplan for the project.

For more information have a look at the IIF website

Another look at estimators for intermittent demand

F. Petropoulos, N. Kourentzes and K. Nikolopoulos, 2016, International Journal of Production Economics, 181: 154-161. http://dx.doi.org/10.1016/j.ijpe.2016.04.017

In this paper we focus on a new methodology for looking at estimators for intermittent demand data. We propose a new aggregation framework for intermittent demand forecasting that performs aggregation over the demand volumes, in contrast to the standard framework that employs temporal (over time) aggregation. To achieve this we construct a transformed time series, the inverse intermittent demand series. The new algorithm is expected to work best on erratic and lumpy demand, as a result of the variance reduction of the non-zero demands. The improvement in forecasting performance is empirically demonstrated through an extensive evaluation in more than 8,000 time series of two well-researched spare parts data sets from the automotive and defence sectors. Furthermore a simulation is performed so as to provide a stock-control evaluation. The proposed framework could find popularity among practitioners given its suitability when dealing with clump sizes. As such it could be used in conjunction with existing popular forecasting methods for intermittent demand as an exception handling mechanism when certain types of demand are observed.

Download paper.

Academia vs. Business: Two Sides of the Same Coin

Issue 41 of Foresight featured a short commentary by Sujit Singh on the gaps between academia and business. Together with Fotios Petropoulos, motivated by our focus to produce and disseminate research that is directly applicable to practice, in this commentary we present our views on some of the very useful and interesting points raised by Sujit and conclude with our vision for enhanced communication between the two worlds.

On translating accuracy to money

It is true that the majority of traditional error measures (along with the very widely used in practice MAPE) focus on the performance of point forecasts and their respective accuracy. These are convenient as summary statistics that are context free, but hardly relate to the real decision costs. Therefore, a critical question is how these are translated into business value and how improving forecasting affects utility metrics, such as inventory and backlog costs, customer service level (CSL) and mitigating the bullwhip effect. Fortunately, there is a good bit of research that focuses on such links. Here two very recent examples.

Barrow and Kourentzes (2016) explored the impact of forecast combinations – combining forecasts from different methods — on safety stocks and found that combinations can lead to reductions compared to using a single `best’ forecast. Wang and Petropoulos (2016) evaluated the impact on inventory of base-statistical and judgmentally-revised forecasts. These works show that there is a strong connection between the variance of forecast errors and improved inventory performance.

However, one important point has to be emphasised here: there is limited transparency how forecasts produced by demand planners are translated into ordering decisions by inventory managers. Research typically looks at idealized cases, ignoring the targets and politics that drive inventory decisions. In such cases, the economic benefit of improved forecasts may not reflect organizational realities: forecasting research should pay more attention to the organisational aspects of forecasting.

On what is good accuracy

Forecast accuracy levels vary across the different industries and horizons. For example a 20% forecast error would be sensible in certain retailing setups, but disastrous in aggregate electricity load forecasting. Short-term forecasting is typically easier, while long-term is more challenging. The nature of the available data is also relevant: fast versus slow moving items; presence of trend and/or seasonality; promotional frequency and so on.

Our approach would be always to benchmark against (i) simple methods, such as naïve or seasonal naïve and (ii) industry-specific (“best practices”) benchmarks. Reporting the improvements in accuracy relative to a these benchmarks helps identify specific problems with the forecasting function and can lead to further refinements. Using relative metrics also overcomes the misplaced focus on what is a good target for percentage accuracy, since these targets do not appreciate the data intricacies that the forecast has to deal with.

On available software packages

Different software packages offer different core features, with some of them specialising in specific families of methods and/or industries. Previously, software vendors were invited to participate in large-scale forecasting exercises (see M3-competition) with the relative rankings of the participating software being available through the original (Makridakis and Hibon, 2000) and subsequent research reports.

In any case, the expected benefits from adopting a software package are a function of data availability, the forecast objective (what needs to be forecast and how long into the future) and the need for automation. Nonetheless, there is need for an up-to-date review and benchmarking of available commercial and non-commercial software packages. Differences exist even in the various implementations of even the simplest methods (such as Simple Exponential Smoothing), with often unknown effects in accuracy. But software packages are important in structuring the forecasting process but vendors often impose their own visions of what is important and these are not often backed up by research. How should one explore the time series at hand? Can we support model selection and specification? How to best incorporate judgemental adjustments?

Our view is that software vendors should provide the tools for users of varying expertise to solve their problems (see comments on customisability by Petropoulos, 2015), but also be explicit about the the risks of a solution. Training users is regarded as an important dimension of improving the forecast quality (Fildes and Petropoulos, 2015) as demand planners cannot be replaced by an algorithm. We should not aim for a single solution that will magically do everything and there are always `horses for courses’.

On hierarchical forecasts

Organisations often look at their inventory of data in hierarchies. These can be across products, across markets or across any other classification that is meaningful from a decision making or reporting point of view. Data at different hierarchical levels reveal different attributes of the product history. Although forecasts produced at different hierarchical levels can be translated to forecasts of other levels via aggregation or disaggregation (top-down and bottom-up), the level at which the forecasts are produced will influence the quality of the final forecasts at all the various levels.

Can we know a-priori what is the best level to produce forecasts? Unfortunately, not possible: data have different properties, resulting in different ‘ideal levels’, but, more importantly, companies have different objectives. Each objective may require different setups.

We believe that the greatest benefit from implementation of hierarchical approaches to forecasting is the resulting reconciliation of forecasts at different decision making levels. The importance of aligning decision-making across levels cannot be understated. More novel techniques allows hierarchies to be forecast and reconciled across different forecast horizons (Petropoulos and Kourentzes, 2014). Recent research (Hyndman and Athanasopoulos, 2014) has demonstrated that approaches that focus on a single levels of the hierarchy, such as top-down or bottom-up, should be replaced with approaches that appropriately combine forecasts (and subsequently information) from all aggregation levels.

It’s important to remember that forecasts calculated from data at any level of the hierarchy can be evaluated at all other required levels. One first has to produce the aggregated/disaggregated forecasts and then compare with the actual data points at the respective level.

Forecasts are used by companies

Research often considers forecasting as an abstract function that is not part of a company or its ecosystem. At the same time, there is ample evidence of the benefits of collaborative forecasting and information-sharing both within the different departments of a company and across the supply chain.

A recent example is provided by Trapero and colleagues (2012) who analyse retail data and show that information sharing between retailer and supplier can significantly improve forecasting accuracy (up to 8 percentage points in terms of MAPE). This research is useful both for modelling in the context of how forecasts are generated and used in organizations.

A call for more data and case studies

Sujit urges production of evidence of “minimum/average/maximum” benefits in different contexts. But current forecasting research has analysed very few data sets. And very few company cases are publicly available. The M1 and M3 competition data sets have been utilised time and again in subsequent studies, so that the results and solutions they derived are susceptible to “over-fitting” and hence not generalisable. Most papers on intermittent demand forecasting make use only of automotive-sales data as well as data sets from the Royal Air Force in the UK. It would be valuable to test our theories and methods on more diverse data sets, but researchers find these are hard to acquire.

We call on practitioners and on vendors to share (after anonymising) empirical data with researchers. The availability of a large number of time series and/or cross-sectional data across a number of industries will increase our understanding of the advantages, disadvantages, and limitations of existing and new forecasting methods, models, frameworks, and approaches.

Researchers are hungry for data while practitioners hunger for solutions to their problems: reducing the barriers will benefit both sides. Still, researchers must appreciate the constraints that limit a company’s willingness to make its data public, and practitioners need to be more proactive in facilitating forecasting researcher.

References

Barrow D. and Kourentzes N. (in press) “Distributions of forecasting errors of forecast combinations: implications for inventory management“, International Journal of Production Economics.

Fildes R. and Petropoulos F. (2015) “Improving forecast quality in practice”, Foresight: The International Journal of Applied Forecasting 36, pp. 5–12.

Hyndman R. and Athanasopoulos G. (2014) “Optimally reconciling forecasts in a hierarchy”, Foresight 35 (Fall 2014), pp. 42–48.

Makridakis S. and Hibon M. (2000) “The M3-competition: results, conclusions and implications”, International Journal of Forecasting 16, pp. 451-476.

Petropoulos F. & Kourentzes N. (2014) “Improving forecasting via multiple temporal aggregation”, Foresight: The International Journal of Applied Forecasting, Issue 34 (Summer 2014), pp. 12-17

Petropoulos F. (2015) “Forecasting Support Systems: ways forward”, Foresight: The International Journal of Applied Forecasting, Issue 39 (Fall 2015), pp. 5-11.

Wang X. and Petropoulos F. (in press) “To select or to combine? The inventory performance of model and expert forecasts”, International Journal of Production Research.

Trapero J.R., Kourentzes N. and Fildes R. (2012) “Impact of Information Exchange on Supplier Forecasting Performance“, Omega 40, pp. 738-747.

This text is an adapted version of:

F. Petropoulos and N. Kourentzes, 2016, Commentary on “Forecasting: Academia vs. Business”: Two Sides of the Same Coin, Foresight: The International Journal of Applied Forecasting.

The Impact of Special Days in Call Arrivals Forecasting: A Neural Network Approach to Modelling Special Days

D. Barrow and N. Kourentzes, 2016, European Journal of Operational Research. http://dx.doi.org/10.1016/j.ejor.2016.07.015

A key challenge for call centres remains the forecasting of high frequency call arrivals collected in hourly or shorter time buckets. In addition to the complex intraday, intraweek and intrayear seasonal cycles, call arrival data typically contain a large number of anomalous days, driven by the occurrence of holidays, special events, promotional activities and system failures. This study evaluates the use of a variety of univariate time series forecasting methods for forecasting intraday call arrivals in the presence of such outliers. Apart from established statistical methods we consider artificial neural networks (ANNs). Based on the modelling flexibility of the latter we introduce and evaluate different methods to encode the outlying periods. Using intraday arrival series from a call centre operated by one of Europe’s leading entertainment companies, we provide new insights on the impact of outliers on the performance of established forecasting methods. Results show that ANNs forecast call centre data accurately, and are capable of modelling complex outliers using relatively simple outlier modelling approaches. We argue that the relative complexity of ANNs over standard statistical models is offset by the simplicity of coding multiple and unknown effects during outlying periods with ease.

Download paper.

International Symposium on Forecasting Presentations

Last week I attended the International Symposium on Forecasting 2016. It was very interesting and enjoyable. Apart from the workshop on forecasting with R, I gave the following presentations:

Forecasting with Temporal Hierarchies
This presentation was given to the practitioner track of the conference and its aim was to introduce the basic idea of temporal hierarchies for forecasting and highlight the forecasting and business challenges they can help address. You can read more details in the abstract

Download presentation.

Measuring Forecasting Performance: A Complex Task
This research presentation introduces a new error metric to evaluate forecast performance. Although there has been substantial research on accuracy metrics, there has been very limited work on bias metrics that is an equally important dimension of performance. An interesting aspect of the proposed metric are the informative visualisations of accuracy and bias. You can read more details in the abstract

Download presentation.

Material for `Forecasting with R: A practical workshop”

Together with Fotios Petropoulos we gave a workshop on producing forecasts with R, at the International Symposium on Forecasting, 2016. You can find the material of the workshop here. The workshop notes assume knowledge of what the various forecasting methods do, which are only briefly explained in the workshop’s slides, and mostly focuses in showing which functions to use and how, so as to perform a wide variety of forecasting tasks:

Time series exploration
Univariate (extrapolative) forecasting
Intermittent demand series forecasting
Forecasting with regression
Special topics: (i) Hierarchical forecasting; (ii) ABC-XYZ analysis; and (iii) LASSO regression

Material:

Workshop notes: these provide code examples with comments. You will also find some references for the various methods used in the workshop.
Workshop slides: these provide an extremely brief overview of some of the methods used and their implementation.
Workshop R solution scripts: these replicate the examples in the notes.
Workshop data: these are needed to replicate the examples in the notes and scripts.

The notes are aimed at researchers and experienced practitioners, who are comfortable with the theory behind the various models and methods. Nonetheless, they demonstrate how to quickly explore data, fit models, produce forecasts and evaluate them for a wide range of cases. I hope you find this material useful.

How to choose a forecast for your time series

Choosing the most appropriate forecasting method for your time series is not a trivial task and even though there has been scientific forecasting for so many decades, how to best do it is still an open research question. Nonetheless, there are some reasonable ways to deal with the problem, which although they may not be perfect, they can certainly provide very reasonable results and help you automate the forecasting process.

I will not attempt to provide a textbook style description of how to do it, but I will try to provide you some of the basic alternative approaches, paying additional attention to common mistakes that I have seen in industry and practice.

First, we need to have clear objectives:

What do we need to forecast?
What is the criterion or criteria of success?
What is the relevant forecast horizon?

These may sound like common sense, but perhaps the most common misconception I come across in companies is that they expect a single forecasting method to be good at producing forecasts for all short, medium and long forecast horizons. This is an `optimistic’ way of looking at forecasting, which will often result in poor forecasts for some of the horizons.

1. A general approach: using a validation sample

Statistical forecasts require past historical data, which we can use in many ways. Suppose we have a monthly time series with 5 years of data and our objective is to forecast the future demand six months in the future. As it can be seen in Fig. 1 the time series is clearly seasonal and trending. As such two alternative forecasts have been build, one using Exponential Smoothing and one using ARIMA (there is no need to go in the model details).

Fig. 1: An exponential smoothing (ETS) and ARIMA forecast.

Both forecasts seem reasonable and close enough to be able to judge which one to pick just by visually inspecting them. A statistically sound way to do this is to split the series into two parts, one to fit the model and one to validate how well it forecasts and pick the model that is the most accurate. Fig. 2 demonstrates this. The models are fit in the first part of the series and the validation set is not used at all for this. Then out-of-sample forecasts are produced and compared with the values in the validation set, as if these were true future forecasts. This is why it is also important to not use any of that data for the model building.

Fig. 2: ETS and ARIMA forecasts on validation set. The models are fitted only in the first part of the time series and the validation set is used only to assess their performance.

We can then measure the accuracy of the forecasts on the validation set and pick the best performing model. In this example I do this by measuring the Mean Absolute Error (MAE), which indicates that ETS is the best.

Model	MAE
ETS	132.15
ARIMA	160.85

The example above is quite easy to repeat with as many forecasting methods as needed, considering even judgmentally generated or adjusted forecasts. However there is an issue with it, we are relying our choice on a single measurement. This is risky, as we may be very (un)lucky on this measurement and it may not be representative of the normal behaviour of the forecasting methods. To mitigate this problem we should perform a rolling origin evaluation on the validation set. To do this we need to use a validation set that is longer than the target forecast horizon and then follow these steps:

Produce a forecast for 6 steps ahead
Increase the fitting sample by one period (so the first point of the validation set is moved to the fitting set and the validation set is smaller by one period)
Refit the model in the new fitting set
Go back to step 1. and repeat until all the validation set is exhausted, i.e. no more forecasts of 6-periods ahead can be evaluated.

In this example we could use 18 months as a validation set. This way we can produce 18 – 6 + 1 = 13 6-step ahead forecasts, spanning 1.5 years of data and get a more reliable measurement of the MAE, as is shown in Fig. 3.

Fig. 3: Rolling origin forecasts. Forecast from each origin are plotted with different colours.

Observe that the second ARIMA forecast is rather poor (a straight like that does not follow the seasonal shape). This is also reflected in the errors for each forecast origin that are reported in the following table. We can see that although ETS is not always best (ARIMA is best for origins 6, 7 and 8), on average it is. If we would have looked at only a single forecast we could potentially end up selecting ARIMA instead. Looking at multiple forecasts gives us confidence to choose the overall best and therefore ETS is more reliable and should be used to produce any future forecasts.

Origin	ETS	ARIMA
1	76.81	106.00
2	81.77	467.16
3	86.71	123.83
4	85.93	114.76
5	93.09	103.62
6	95.11	82.90
7	83.44	71.80
8	99.23	82.21
9	107.55	116.51
10	100.13	123.03
11	108.71	137.62
12	109.88	147.90
13	132.15	160.85
Average	96.96	141.40

Using a validation sample does not impose any particular restrictions in terms of what forecasts can be compared, but it is obvious that we need to `sacrifice’ sample for the validation set. This sample could be used to build better forecasts as part of the fitting set.

2. A more specialiased approach: Information Criteria

An alternative is to use what is called Information Criteria (IC). There are many criteria and the most common are Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC). Before explaining what IC do let us keep in mind that a more complex model is typically able to fit better to past data, to the extend it may overfit. If it overfits then it will not produce good forecasts for unseen future data. So we do not want to select the model that has the lowest forecast error in the fitting sample, as it may just have overfitted. Note that using a validation set avoids this problem by measuring the forecasting accuracy in a different sample altogether: the validation set. For an example of what overfitting is you can check this.

IC try to balance how good a model fits and its complexity. In general IC are of the following form:

IC = goodness of fit + penalty for model complexity.

Model complexity is typically the number of model parameters scaled by some factor to make it comparable to the goodness of fit metric, which itself is based on the idea of the likelihood function. I will try to avoid providing an explanation of what likelihood is, and continue to provide the definition of AIC (you can skip the equations without missing the logic):

$AIC=2k-2ln{(L)}$ ,

where k is the number of model parameters, ln is the natural logarithm and L is the likelihood function. Under conditions (the residuals to follow independent identical normal distributions with zero mean) the maximum likelihood estimate for the variance (σ) of the model residuals is equal to the Mean Squared Errors (MSE):

$MSE=\sum_{i=1}^{n}{\left(y_i-f_i \right)^2}$ ,

where y are the actuals, f the forecasts and n the sample size. Using that we can write:

$ln({L})=-\frac{n}{2}ln{(2\pi)} -\frac{n}{2}ln{(\sigma^2)} -\frac{n}{2\sigma^2}MSE = -\frac{n}{2}ln(MSE) + C$ ,

where C is a constant that depends only on the sample that is used and not the forecasting model. This is a critical point that I will return to later. This in turn gives:

$AIC=2k-2ln{(L)}=2k -2 (-\frac{n}{2}ln(MSE) + C_1) = 2k + nln{(MSE)} - 2 C$

Since the aim is to measure the AIC between two models fit on the same data, we are interested in the differences between two AICs that will have the same –2C, so we can safely ignore this term and finally write:

$AIC = 2k+nln({MSE})$ .

This views allows us to clearly see how AIC works (if you skipped the equations you want to start reading again here!). A model that fits well will have small MSE. If to do that it needs a lot of model parameters (complexity) then the term 2k will be large, thus making AIC larger. The model with the smallest AIC will be the model that fits best to the data, with the least complexity and therefore less chance of overfitting. BIC is similar in construction, but imposes an even stricter penalty for the number of parameters.

The beauty of ICs is that since the chance of overfitting is accounted in the metric explicitly there is no need for a validation sample. This leaves more data for model fitting. The downside it that the sample over which is calculated has to remain the same. Changing the number of data points, transforming the sample or in any way manipulating it invalidates the comparison (due to the calculation of the constant C₁ and MSE). This is why ICs often come with the warning: use ICs only within a single model family. This is actually misleading, but often enough to make people use it correctly. For example we can use AIC to compare between different exponential smoothing models, but not between exponential smoothing and ARIMA models. The reason is that there is a good chance somewhere in the modelling the sample will be manipulated in a way that it will invalidate the comparison. For readers who are familiar with ARIMA models also consider that we should not use ICs to compare ARIMA(p,0,q) and ARIMA (p,d,q) for d>0, as the differenced data have different sample size and scale to the original observations.

For example, let us compare two similarly looking exponential smoothing models: ETS(M,A,M) and ETS(M,Ad,M); the latter using damped trend which requires an additional model parameter. The model fits and forecasts are shown in Fig. 4, where we can see that they look almost identical.

Fig. 4: Forecasts using two exponential smoothing models.

The AIC values for these are:

Model	AIC
ETS(M.A.M)	767.30
ETS(M.Ad.M)	770.41

ETS(M,A,M) is preferable (minimum AIC), as the additional parameter does not offer any substantial fitting benefits. Obviously both forecasts are practically the same, so this comparison is only useful as an example. Nonetheless, given the large variety of exponential smoothing models (type of trend, type of seasonality, type of error term), using ICs can help us to quickly pick the best model form without losing data to the validation set. But this comes at the cost that we can only use ICs when the data do not change (or as it is commonly said: within a single model family). Note that the data may be changed by the model internally, without the user being aware, so when in doubt avoid!

3. Avoid selection altogether: combine forecasts

Another approach that is quite popular in research is to avoid selecting a single forecast altogether. We can do this by combining forecasts. Returning to Fig. 1 we can take the values of both forecasts and calculate the arithmetic mean for each period:

Forecast	Jan	Feb	Mar	Apr	May	Jun
ETS	3586.245	3397.946	4204.788	3981.081	4101.305	4394.184
ARIMA	3678.148	3491.572	4233.729	3983.787	4108.809	4327.817
Combined	3632.196	3444.759	4219.259	3982.434	4105.057	4361.000

We can combine the forecasts from as many sources as desirable. Furthermore, there are many different ways to combine forecasts. although the simple arithmetic mean often works remarkably well. There is a lot of evidence demonstrating that forecast combination leads to very reliable and accurate forecasts. If you want you can explore further this paper that introduces several combination methods and evaluates their performance in terms of forecasting and inventory management.

4. Concluding remarks

Using a validation set or information criteria are two common approaches to select the best forecast. Each has its advantages and disadvantages. Using a validation set is general and can be used to compared forecasts from any source, but comes at a cost of sample size. On the other hand we can use information criteria without withholding any data for the validation set, thus permitting better model specification, but can only be used within the same model family and requires the forecasts to come from a formal statistical model, so that the likelihood function can be calculated. These are not the only two ways to select forecasts, other approaches such as meta-learning exist, but none is perfect in every aspect.

Personally I used to like the idea of a single `correct’ model. This also has the advantage that with a single model it is often very easy to get prediction intervals or use the model coefficients, for example to calculate elasticities. However, I nowadays have warmed to the idea of forecast combinations, since there is no need to rely on a single forecast. Why should we put all our trust on a single specific model that is fit in past data to forecast the future? I would rather hedge my risk and use multiple forecasts. Of course, it is also convenient that combined forecasts are often quite accurate!

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts

Related Posts