Combining forecasts has been shown in many cases to lead to improvements in forecasting performance, in terms of accuracy and bias. This is also common in forecasting with neural networks or other computationally intensive methods, where ensemble forecasts are considered more accurate than individual model forecasts. A useful feature of forecast combination is that it mitigates uncertainty in the model selection, model parameters and sampling uncertainty.
A very common combination operator is the mean. There is strong empirical evidence in the forecasting literature that the unweighted mean is as good as more complex combination approaches. A defining difference between forecast combinations and neural network ensembles is the number of models that can be combined. In the case of the latter, we can easily produce as many members as we want, typically either through bagging or varying the training initialisation values. Both of these approaches aim to mitigate the uncertainty coming from the available sample and model parameters. With a sufficiently high number of ensemble members we can estimate the distribution of forecasts that are to be combined, something that is typically not possible with forecast combination of a limited number of models. With that we can view the problem of forecast combination as an estimation of the location of the distribution. Following this line of thought, it becomes apparent that the mean is an appropriate combination operator only for well behaved distributions: unimodal and symmetric. In practice, this is a very strong requirement.
A recent paper looked at the effect of using three fundamental combination operators: mean, median and mode. They all aim to estimate the location of the distribution, but with a different level of robusness to deviations from normality. Median is less sensitive to outliers and more robust than mean to asymmetries in the forecast distribution, while the mode is insensitive to both. In theory one would expect the latter to produce better combined forecasts for this reason. This was shown empirically to be the case. From the same paper the following figure summarises the results for ensembles of 10 up to 100 members, for the three basic combination operators for two datasets.
The poor performance of mean in comparison to that of median and mode is evident. Furthermore, with sufficient number of forecasts to be combined the mode performs better than the median, again as expected due to its robustness to non-normality. The mode has poor performance for small ensembles, as there is not enough sample to estimate it adequately. Another interesting question is how many ensemble members are required. We can see that both median and mode converge quite fast in both datasets, while the mean does not, even with 100 members. More details on these results can be found in the paper. The main conclusion is that we perhaps trust the mean as a combination operator too much, and we should consider medians or modes of forecast more, especially when there is a large number of forecasts available.
To get a better view of the behaviour of each ensemble operator, the following demo uses the neural network forecasts for one of the time series in the aforementioned paper. The complete series, which is not shown here, contains periods of upward and downward trends. You can experiment with different sizes of ensemble forecasts, up to 100 members. Apart from the combined forecasts and the out-of-sample mean absolute error, this demo tries to visualise the distribution of the forecasts for different horizons and how appropriate the mean, median and mode results are. Finally, the forecast of the “best” model according to an in-sample validation subset is provided. It is apparent that this is over-fitted in the past, capturing only the local downward trend that dominated in the validation set and demonstrates why ensembles are useful.
Note: if the demo does not start, please click on “Resample ensemble members” button to reload.