M. Hibon, S. F. Crone and N. Kourentzes, 2012, The 32nd Annual international Symposium on Forecasting, Boston.
The series of M-competitions (1982, 1993,2000) have firmly established forecasting competitions as an objective approach to assess the empirical ex ante accuracy of competing forecasting methods. While the original M-competitions merely ranked algorithms by performance , Koning et al. (2005) assessed the results of M3 for the significance of the error differences using two non-parametric statistical tests: multiple comparisons with the mean (ANOM) and multiple comparisons with the best (MCB). As a result, they identified few methods which significantly outperformed all others.
However, MCB and ANOM have been subject to criticism due to their sensitivity to sample size of methods assessed and their limited interpretability, as both provide only a binary classification whether a model is significantly “better” or “worse” to the average (ANOM) or the best (MCB) performance of all models. This does not provide information on how individual models differ within each of these classes, and whether such differences are significant. To overcome these limitations, the Friedman and Nemenyi test has been proposed (Demsar, 2006), which is frequently employed in model comparisons in data mining ad machine learning.
We empirically assess (a) the robustness of ANOM and MCB in comparison to the Friedman and Nemenyi test, and (b) the interpretability of results of the significantly different subgroups. As a dataset, this study revisits the results of the recent NN3 competition by Crone et al. (2011), which extended the M3 competition to 59 new algorithms of computational intelligence in comparison to a subset of the 5 best models originally submitted to M3, evaluated on two masked subsets of 111 and 11 empirical time series of monthly M3 industry data. We assess the robustness of the statistical tests (1) on the error measure that is used to derive model ranks and provide suggestions, (2) their the sensitivity to the number of included models by including all contenders of the M3, and (3) their sensitivity to the number of time series compared. As a result we derive the robustness of each test for performance evaluations in competitions as well as empirical simulation studies of forecasting.