Benchmarking Facebook’s Prophet

By | July 29, 2017

Last February Facebook open sourced its Prophet forecasting tool. Since, it had appeared in quite a few discussions online. A good thing about Prophet is that it one can use it very easily through R (and Python). This gave me the opportunity to benchmark it against some more standard – and not! – forecasting models and methods. To do this I tried it on the M3 competition dataset (available through the Mcomp package for R).

I should start by saying that the development team of Prophet suggests that its strengths are:

  • high-frequency data (hourly, daily, or weekly) with multiple seasonalities, such as day of week and time of year;
  • special events and bank holidays that are not fixed in the year;
  • in the presence of missing values or large outliers;
  • changes in the historical trends, which themselves are non-linear growth curves.

The M3 dataset has multiple series of micro/business interest and as a recent presentation by E. Spiliotis et al. at ISF2017 (slides 11-12) indicated, the characteristics of the time series overlap with typical business time series, albeit not high frequency. However, a lot of business forecasting is still not hourly or daily, so not including high frequency examples for many business forecasters is not necessarily an issue when benchmarking Prophet.

The setup of the experiment is:

  • Use Mean Absolute Scaled Error (MASE). I chose this measure as it has good statistical properties and has become quite common in forecasting research.
  • Use rolling origin evaluation, so as ensure that the reported figures are robust against particularly lucky (or unlucky) forecast origins and test sets.
  • Use the forecast horizons and test sets indicated in Table 1, for each M3 subset.
Table 1. M3 dataset
Set No. of series Horizon Test set
Yearly 645 4 8
Quarterly 756 4 8
Monthly 1428 12 18
Other 174 12 18

I used a number of benchmarks from some existing packages in R, namely:

  • forecast package, from which I used the exponential smoothing (ets) and ARIMA (auto.arima) functions. Anybody doing forecasting in R is familiar with this package! ETS and ARIMA over the years have been shown to be very strong benchmarks for business forecasting tasks and specifically for the M3 dataset.
  • smooth package. This is a less known package that offers alternative implementations of exponential smoothing (es) and ARIMA (auto.ssarima), which follow a different modelling philoshopy than the forecast package equivalents. If you are interested, head over to Ivan’s blog to read more about these (and other nice blog posts).  forecast and smooth packages used together offer a tremendous flexibiltiy in ETS and ARIMA modelling.
  • MAPA and thief packages, which both implement Multiple Temporal Aggregation (MTA) for forecasting, following to alternative approaches that I detail here (for MAPA) and here (for THieF). I included these as they have been shown to perform quite well on such tasks.

The idea here is to give Prophet a hard time, but also avoid using too exotic forecasting methods.

I provide the mean and median MASE across all forecast origins and series for each subset in tables 2 and 3 respectively. In brackets I provide the percentage difference from the ETS’ accuracy. In boldface I have highlight the best forecast for each M3 subset. Prophet results are in blue. I provide two MAPA results, the first uses the default options, whereas the second uses comb=”w.mean” that is more mindful of seasonality. For THieF I only provide the default result (using ETS), as in principle it could be applied to any forecast on the table.

Table 2. Mean MASE results
Set ETS ARIMA ES (smooth) SSARIMA (smooth) MAPA MAPA (w.mean) THieF (ETS) Prophet
Yearly 0.732 (0.00%) 0.746 (-1.91%) 0.777 (-6.15%) 0.783 (-6.97%) 0.732 (0.00%) 0.732 (0.00%) 0.732 (0.00%) 0.954 (-30.33%)
Quarterly 0.383 (0.00%) 0.389 (-1.57%) 0.385 (-0.52%) 0.412 (-7.57%) 0.386 (-0.78%) 0.384 (-0.26%) 0.400 (-4.44%) 0.553 (-44.39%)
Monthly 0.464 (0.00%) 0.472 (-1.72%) 0.465 (-0.22%) 0.490 (-5.60%) 0.459 (+1.08%) 0.458 (+1.29%) 0.462 (+0.43%) 0.586 (-26.29%)
Other 0.447 (0.00%) 0.460 (-2.91%) 0.446 (+0.22%) 0.457 (-2.24%) 0.444 (+0.67%) 0.444 (+0.67%) 0.447 (0.00%) 0.554 (-23.94%)

Table 3. Median MASE results
Set ETS ARIMA ES (smooth) SSARIMA (smooth) MAPA MAPA (w.mean) THieF (ETS) Prophet
Yearly 0.514 (0.00%) 0.519 (-0.97%) 0.511 (+0.58%) 0.524 (-1.95%) 0.520 (-1.17%) 0.520 (-1.17%) 0.514 (0.00%) 0.710 (-38.13%)
Quarterly 0.269 (0.00%) 0.266 (+1.12%) 0.256 (+4.83%) 0.278 (-3.35%) 0.254 (+5.58%) 0.254 (+5.58%) 0.262 (+2.60%) 0.388 (-44.24%)
Monthly 0.353 (0.00%) 0.348 (+1.42%) 0.351 (+0.57%) 0.373 (-5.67%) 0.352 (+0.28%) 0.351 (+0.57%) 0.351 (+0.57%) 0.473 (-33.99%)
Other 0.275 (0.00%) 0.269 (+2.18%) 0.270 (+1.82%) 0.268 (+2.55%) 0.283 (-2.91%) 0.283 (-2.91%) 0.275 (0.00%) 0.320 (-16.36%)

Some comments about the results:

  • Prophet performs very poorly. The dataset does not contain multiple seasonalities, but it does contain human-activity based seasonal patters (quarterly and monthly series), changing trends and outliers or other abrupt changes (especially the `other’ subset), where Prophet should do ok. My concern is not that it is not ranking first, but that at best it is almost 16% worse than exponential smoothing (and at worst almost 44%!);
  • ETS and ARIMA between packages perform reasonably similar, indicating that although there are implementation differences, both packages have followed sound modelling philoshopies;
  • MAPA and THieF are meant to work on the quarterly and monthly subsets, where, in line with the research, they improve upon their base model (ETS).

In all fairness, more testing is needed on high frequency data with multiple seasonalities before one should conclude about the performance of Prophet. Nonetheless. for the vast majority of business forecasting needs (such as supply chain forecasting), Prophet does not seem to perform that well. As a final note, this is an open source project, so I am expecting over time to see interesting improvements.

Finally, I want to thank Oliver Schaer for providing me with Prophet R code examples! You can also find some examples here.

6 thoughts on “Benchmarking Facebook’s Prophet

  1. Prasanna

    This is inline with what we observed in our testing with M3 dataset and a custom method. Could this have to do with them fitting an additive model (predicting on trend, seasonality individually) and not being able to strip these three elements without overlap? or, are we supposed to fine tune the models a lot more ? Can we be fairly confident that their way of just generating fourier terms for seasonality is probably not the best way to do it?

    Reply
    1. Nikos Post author

      That fits my understanding as well. I would expect that some appropriate Box-Cox transform prior to generating the Prophet forecasts would help. However, there is a secondary issue. These time series are relatively short, in terms of how many times we observe the cycles (and the various harmonics), which can affect the resulting fourrier terms. I think it is important to stress that this evaluation does not discredit the Prophet forecasts, or its building blocks, but rather casts doubts on the fully automatic methodology it currently uses, when forecasting low frequency data.

      An interesting analogy is the TBATS model available in the forecast package, that uses trigonometric representation of seasonality (after Box-Cox tranformation). In my experience it does not perform great when compared to ETS or ARIMA for low frequency data, as for these data the structure is straightforward enough for either ETS or ARIMA to do a good job. On the other hand, for high frequency time series the number of parameters can make either ETS or ARIMA much harder to use, and TBATS starts being an attractive alternative, especially in the presence of multiple cycles. I would expect to see a relative increase in the performance of Prophet to ETS or ARIMA on high frequency data. This is certainly an interesting research area!

      Reply
  2. AnscombesGimlet

    Are there any standard daily datasets for testing forecast accuracy? Daily seems more difficult to get good forecasts for than higher level aggregates. I’d be very interested in seeing daily forecast results for Prophet vs others as it is my understanding it was built primarily for daily data. Also, do you believe it would be a good model to include in a forecast ensemble since it doesn’t forecast in the same way as other algorithms?

    Reply
    1. Nikos Post author

      Unfortunately I am not aware of any standard daily dataset, though I intend to source some data and repeat the comparison on daily data, where Prophet is supposed to shine. Although my evaluation does not indicate that it forecasts well for low frequency data, I am sure that a lot of thinking has gone into it and it will have datasets and time series that it will perform quite well.
      With regards to the ensemble, intuitively you are quite right, but when a very poor forecast is included in the combination then you can as well harm the performance of the combined forecast. I have a paper under review that explores this aspect and demonstrates that putting some thought in the pool of models to combine can get you a long way in terms of accuracy improvement, but also simplifying your ensemble. I hope to be able to provide some more info on this soom, when the review of the paper has progressed!

      Reply
      1. AnscombesGimlet

        Very interesting. I have often wondered why no papers exist (that I’ve seen anyway) on how to choose _which_ models to include in an ensemble. The other issue being _how many_ models to include. My current methodology on this is to generate a set of models from unique algorithms, generate all possible combinations with a given ensemble size, then test every combination on multiple test sets. Then average my out-of-sample error measures for each combination and each test set to determine which models to include. For instance, if I build 30 models, and want to test 5-model ensembles, this would generate 142,506 combinations. As you can imagine, if you boost up the initial set of models to 50+, or choose a larger ensemble size (up to a point), it becomes very costly to test every combination due to memory constraints. Kind of a brute-force approach and probably has issues I haven’t thought of. I haven’t heard of others trying this though, any thoughts on this approach?

        Interesting paper, I look forward to it! I really wish we could find a way to improve the methods of combinations. It seems median is arguably the best performing, but I’d love to see more research in this area.

        Lastly – do you have any thoughts on bagging ensembles? Would this be an appropriate way to squeeze out extra forecast accuracy? It certainly would come at a massive time cost, but sometimes that is worth it.

        Reply
        1. Nikos Post author

          Yes, the issue would be that it quickly gets very expensive to run through all the combinations, but as you rightly say, both “which” and “how many” questions are important and arguably unresolved.
          My thinking (which is reflected in that paper, wonder what the reviewers will say when I get comments back) is that we should first pre-filter the members of the ensemble and then bother with the combination weights/approach. So it is a hybrid between model selection/combination that is done sequentially.
          As to your last point, I have found that using the mode of an ensemble prediction, as obtained via kernel density estimation, performs better than the median, but requires more members. Here is the reference:

          Reply

Leave a Reply

Your email address will not be published. Required fields are marked *