The difference between in-sample fit and forecast performance

By | November 29, 2016

One of the fundamental differences in conventional model building, for example they way textbooks introduce regression modelling, and forecasting is how the in-sample fit statistics are used. In forecasting our focus is not a good description of the past, but a (hopefully) good prediction of the yet unseen values. One does not necessarily imply the other. A good fit does not necessarily lead to a good forecast, and vice-versa. For example, overfit models will typically have very small in-sample errors, but be terrible at forecasting. An example of the opposite case is shrinkage, where we sacrifice in-sample fit for good generalisation of the model.

A typical premise of forecasting model building is that a good fit in the past should lead to a reasonable forecast in the future. Implicitly we assume that there is some underlying data generating process, which we hopefully approximate well enough, to forecast it in the future. I do not believe that we can identify the true process for real data – I have come to doubt even if a true process exists, though this is a philosophical point. Here I will not make a discussion on how to avoid overfitting (for a related discussion see this post). Instead I will provide a slightly unusual example that I built for my students to exemplify the difference between in-sample fitting and forecasting.

For this example, I will not use a business time series, but music. We can extract from a song its waveform and analyse it as a time series, and even try to forecast it. There is a catch though, there is an inherent causality in music. It typically follows some logical structure, harmony and so on that a standard forecasting model is unaware of. For example, ARIMA or exponential smoothing, will attempt to find structure and repeating patterns in the past and extrapolate them in the future, unaware of any linguistics or musical constructs.

Let us take a song, sample its first 10 seconds, at 11,025 observations per second and fit an adequate ARIMA. Using standard unit root testing and AICc we identify an ARMA(5,0,4) as the best model. Note that the exact model used does not make much difference for this example, and different order ARIMA would provide more or less the same conclusions.

Using this model we forecast the next 5 seconds. We can assess the quality of the model fit and the forecast visually, but also as sound. Does the model fit sound anything like the original song? Does the forecast sound anything like the original song? Here it is a good point to warn you that I have chosen a sufficiently interesting song.

Fig. 1 provides a plot of the 1000 last historical observations and the ARMA fit, as well as the forecast for the next 500 observations, where we can easily evaluate the quality of the model fit.


Fig. 1. Plot of the 1000 last historical observations and model fit, as well as the next 500 forecasted periods.

We can observe that the model fit is relatively good, yet the forecast is very weak. In fact, the forecast quickly reverts to the mean and the prediction intervals “contain” all the song. This already makes it quite clear that a good fit does not mean a good forecast. The difference between the quality of the in-sample fit and the out-of-sample forecast is striking when “heard”.

The original sample:

The model fit:

The two sound samples are sufficiently similar. In fact, one can hear that the model fit is missing some of the higher pitches, which can be also seen in Fig. 1. Nonetheless, the fit sound sample is arguably retaining enough of the original song to be recognisable.

The forecast sounds like:

As seen in Fig. 1. the forecast very quickly becomes a flat line, which results in silence.

It is clear that although the model fit captures adequately the song, the forecast is useless. Why is this the case? The modelling methodology I followed here is geared towards good in-sample fit, rather than forecasting. Arguably the model family I used is also inappropriate. However, neither of these becomes apparent by consulting only the in-sample fit.

Generalising, consulting conventional in-sample statistics such as R2, mean squared error, etc. is not only useless, but also dangerous, when the model building purpose is forecasting. Consulting penalised statistics, such as information criteria (AIC, BIC, etc.) is useful, but again care should be taken. In our example the best model was chosen using AICc and it still did not produce any useful forecast. Business forecasting is not that dissimilar from the example here.

The reason that forecasting fails here is that there was no consideration during model fitting to identify the key drivers of the time series, in this case the musical and linguistic structure (though there are no lyrics in the sample used here). Instead, a generic extrapolative model was used. Business series often have similar complexities, such as promotions, price changes, events, and so on. These should be part of the model building exercise. Simply flagging observations as outlying, or merely identifying and fitting the “best” model from a limited pool of alternatives is a dangerous strategy. What I often argue to my students is that a good modeller should appreciate the problem context and use it! In-sample statistics are woefully inadequate in doing this, especially when forecasting is the objective.

A final note: all rights for the song are with the original holders. I am in no way a composer myself and the song is not my work! Can anyone recognise which song it is?

8 thoughts on “The difference between in-sample fit and forecast performance

  1. Pingback: The difference between in-sample fit and forecast performance – Forecasters blog

  2. Cristofer

    Behemoth! Excellent!

    BTW, I’m studying intermittent demand and did reached your blog through your published articles. Amazing work.

  3. Rahul

    Great post. Precisely touches the topic which most articles conveniently skip/ overlook.

    However, I was hoping if you can throw some info on “Should one keep aside certain part of the sample (test set/ validation set) for forecasting and validate the model?”. If the answer to that is yes, then lets say I am building a Triple Exp Smoothing model,and arrive at final alpha, beta and gamma values by minimizing RMSE (MAPE, et. al) on test set. The part that I am slightly apprehensive is this: Once, I have found the best alpha, beta and gamma values, should I apply these final values on the entire set (train + test) AGAIN, and then perform forecasting (for the out of sample/ future values) right after the point where test set ends? Or would you do something different?

    Would appreciate any direction in the matter.

    1. Nikos Post author

      Thanks Rahul! This is a good question. Indeed during model building it is helpful to separate your available sample into appropriate subsets, for example training, validation and test. You would optimise your parameters in the training set as normal and then evaluate your forecasts on the validation/test sets, depending on your experimental setup. Once you have decided on the best model, for example the Triple Exponential Smoothing you mention, you will need to re-optimise its parameters across the complete sample. This way you will have more observations to get better estimates and also take advantage of the most recent information in the time series. That final re-optimised model is what should be used for forecasting.


Leave a Reply

Your email address will not be published. Required fields are marked *