The tsutils package for R includes functions that help with time series exploration and forecasting, that were previously included in the TStools package that is only available on github. The name change was necessary as there is another package on CRAN with the same name. The objective of TStools is to provide a development and testing ground for various functions developed by my group that has led to the creation of various packages over time, such as smooth, nnfor and now tsutils.
The package provides functions to support the following tasks:
- tests and visualisations that can help the modeller explore time series components and perform decomposition;
- modelling shortcuts, such as functions to construct lagmatrices and seasonal dummy variables of various forms;
- an implementation of the Theta method;
- tools to facilitate the design of the forecasting process, such as ABC-XYZ analyses;
- “quality of life” tools, such as treating time series for trailing and leading values.
We will look at some of these functions here, to show you how tsutils can be helpful in your forecasting projects.
Time series exploration
Exploring your time series is an important step in modelling them. Some standard questions that one has to consider is whether a time series exhibits trend and/or seasonality. First, we deal with the existence of trend. To look into this, it is useful to extranct the Centred Moving Average, and to do this you can use the functon cmav()
.
library(tsutils)
cma <- cmav(referrals) # referrals in a series included in the tsutils package.
The cmav()
function offers a few useful arguments. By default it takes the length of the moving average from the time series information and sets it equal to its frequency. The user can override this by setting the argument ma
. Another useful argument is fill=c(TRUE,FALSE)
that instructs the function to fill the first and last observations of the centred moving average that are ommited in the calculation, due to the “centering”. This is done using exponential smoothing (ets()
from the forecast package). This is done by default. The argument fast
restricts the number of exponential smoothing models considered to speed up the calculation. Finallly the argument outplot=c(FALSE,TRUE)
provides a plot of the time sereis and the calculated centred moving average.The CMA can help us identify whether there is a trend visually (see this paper for a discussion), but also to help us with statistical testing.
The function trendtest()
provides a logical (TRUE/FALSE) answer for the presence of trend in a time series. It offers two alternatives, controlled by the argument type
, which can take the values "aicc"
or "cs"
that tests by either fitting alternative exponential smoothing models and comparing their AICc or by using the Cox-Stuart test at 5% significance (you can run the test on its own using the function coxstuart()
). The test can be conducted on the raw time series, or its CMA, the latter providing a cleaner signal of the low frequency components of the series, a major component of which is the trend. This is controlled by the argument extract
. When this is TRUE
the test is conducted on the CMA.
# Test on the CMA
trendtest(referrals, extract=TRUE)
## [1] TRUE
# Test on the raw time series
trendtest(referrals, extract=FALSE)
## [1] FALSE
We can see that testing on the unfiltered data provides a different response. A further (experimental) argument is mta
that instructs the function to use Multiple Temporal Aggregation, as proposed in the paper that introduces MAPA. Some evidence of the effect of this is provided here, where it increases the accuracy of both approaches, AICc or Cox-Stuart testing.
The function seasplot()
is handy for exploring the seasonality of the time series. This provides various arguments for controlling the result. You can manually set the seasonal length (ma
– otherwise it uses frequency(y)
), the starting period of the seasonal cycle (s
– also picked up from the time series object, if available), assume trend being present, absent or test (trend
), type of decomposition when trend is present (decomposition
) and using precaclculated CMA for the decomposition (cma
– use NULL
for calculating internally). If there is trend in the time series (as set by the user, or tested) then the time series is first de-trended according to the decomposition
setting and then the seasonal plot is produced.
There are a number of arguments to control the visualisation of seasonality. The argument outplot
takes the following options:
- 0, no plot;
- 1, seasonal diagram;
- 2, seasonal boxplots;
- 3, seasonal subseries;
- 4, seasonal distribution;
- 5, seasonal density.
In addition, one can provide various arguments to the plot function (consider in the ...
input), override the colouring (colour
) and the labels (labels
).
seasplot(referrals)## Results of statistical testing
## Evidence of trend: TRUE (pval: 0)
## Evidence of seasonality: TRUE (pval: 0.002)
The figure below exemplifies the available ploting alternatives.The function provides some useful results, organised in a list: the trend and seasonal compoonents, the logical result of testing for their presence and the p-value of the tests. The trend is tested using the coxstuart()
function, while the seasonality is tested by applying the Friedman test on the data as organised for `outplot=2’, i.e. is there evidence that at least one of the ranks of the distributions of the seasonal indices is significantly different from the rest?
Finally, the function decomp()
provides classical time series decomposition. Many of the arguments are similar to the seasplot
function. Note that type
now defines the method that is used to calculate the seasonal indices, which can be the arithmetic mean, median or a “pure seasonal” exponential smoothing model. That is, simply applying an exponential smoothing on the seasonal indices to smooth any randomness. As the later makes it complicated to extrapolate the seasonal component, the function accepts the argument h
to provide predicted seasonal indices for h-steps ahead. Additionally for the calculation of the arithmetic mean one can winsorise extremes using the argument w
.
dc <- decomp(referrals,outplot=TRUE)
names(dc)
## [1] "trend" "season" "irregular" "f.season" "g"
You can see that the output of the function is a list containing the various time series components. Note that decomp()
does not test for the presence of seasonality will always extract a seasonal component.
Time series modelling
There are a few fuctions that are designed with regression modelling in mind. The function lagmatrix()
takes an input variable and constructs leads and lags as required.
lags <- lagmatrix(referrals, c(0,1,3:5))
head(lags,10)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 933976 NA NA NA NA
## [2,] 890852 933976 NA NA NA
## [3,] 912027 890852 NA NA NA
## [4,] 955852 912027 933976 NA NA
## [5,] 821916 955852 890852 933976 NA
## [6,] 928176 821916 912027 890852 933976
## [7,] 979219 928176 955852 912027 890852
## [8,] 853812 979219 821916 955852 912027
## [9,] 792731 853812 928176 821916 955852
## [10,] 913787 792731 979219 928176 821916
The first column is the original variable (lag 0), while the other columns are the requested lagged realisations. Leads are constructed by using negative numbers for the argument lag
.
The function seasdummy()
creates seasonal dummy variables that can be either binary or trigonometric, as defined by the argument type=c("bin","trg")
. Other arguments are the number of observations to create (n
), the seasonal periodicity (m
), a time series (y
) that is an optional argument and can be useful to get the seasonal periodicity and the starting period information. The latter is useful when the seasonality is trigonometric to get the phase of the sines and cosines correctly. Argument full
retains or drops the m-th dummy variable. For linear models that would typically be set to full=FALSE
. For nonlinear models this becomes a somewhat more interesting modelling question. Some evidence for neural networks is provided in this paper.
seasdummy(n=10,m=12)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 1 0 0 0 0 0 0 0 0 0 0
## [2,] 0 1 0 0 0 0 0 0 0 0 0
## [3,] 0 0 1 0 0 0 0 0 0 0 0
## [4,] 0 0 0 1 0 0 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 0 0 0 0
## [6,] 0 0 0 0 0 1 0 0 0 0 0
## [7,] 0 0 0 0 0 0 1 0 0 0 0
## [8,] 0 0 0 0 0 0 0 1 0 0 0
## [9,] 0 0 0 0 0 0 0 0 1 0 0
## [10,] 0 0 0 0 0 0 0 0 0 1 0
For example, one can use these two functions to quick construct dynamic regression models.
Xl <- lagmatrix(referrals,c(0:2))
Xd <- seasdummy(n=length(referrals),y=referrals)
X <- as.data.frame(cbind(Xl,Xd))
colnames(X) <- c("y",paste0("lag",1:2),paste0("d",1:11))
fit <- lm(y~.,X)
summary(fit)
##
## Call:
## lm(formula = y ~ ., data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99439 -43116 7114 38919 98649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.808e+05 1.695e+05 2.836 0.00672 **
## lag1 1.469e-01 1.337e-01 1.098 0.27760
## lag2 1.799e-01 1.397e-01 1.288 0.20394
## d1 1.751e+05 3.820e+04 4.584 3.38e-05 ***
## d2 1.867e+05 4.161e+04 4.488 4.64e-05 ***
## d3 2.445e+05 3.485e+04 7.015 7.75e-09 ***
## d4 1.285e+05 4.204e+04 3.056 0.00369 **
## d5 1.840e+05 3.892e+04 4.727 2.10e-05 ***
## d6 2.020e+05 3.709e+04 5.447 1.83e-06 ***
## d7 2.054e+05 3.621e+04 5.673 8.39e-07 ***
## d8 1.139e+05 3.613e+04 3.152 0.00282 **
## d9 1.698e+05 3.604e+04 4.713 2.21e-05 ***
## d10 2.168e+05 3.749e+04 5.783 5.73e-07 ***
## d11 1.616e+05 3.658e+04 4.418 5.83e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56250 on 47 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5882, Adjusted R-squared: 0.4743
## F-statistic: 5.165 on 13 and 47 DF, p-value: 1.365e-05
Obviously, one might prefer to have the seasonal dummies as a factor variable, when these are to be coded as binary dummy variables. Users of the nnfor package may recognise these two functions. In the next release of that package, both functions will be moved to the tsutils (this is already done in the development version).
The function residout()
looks at model residuals from a control chart point of view and identifies outlying ones. You can manually set the threshold over which the standardised residuals will be considered outlier (t
) and whether to produce a plot (outplot
).
rsd <- residout(residuals(fit))
If there are outliers, their location and values are provided in the outputted list.
err <- rnorm(20)
err[5] <- err[5]+5
residout(err,outplot=FALSE)$location
## [1] 5
The idea behind this function is to identify outliers so that the modeller can construct dummies to account for them or manipulate them otherwise.
A somewhat more exotic function is getOptK
that provides the optimal temporal aggregation level for AR(1), MA(1) and ARMA(1,1) processes. The user defines the type of data generating process that we expect (type=c("ar","ma","arma")
) and the maximum aggregation level (m
). The output is the theoretically optimal aggregation level.
getOptK(referrals) # This time series is clearly not an AR(1)!
## [1] 9
For a discussion of the pros and cons of this theoretical calculation see this paper. My personal view is that Multiple Temporal Aggregation as implemented in MAPA or Temporal Hierarchies (thief) is a less risky modelling approach and therefore more desirable for practice.
The function theta()
provides an implementation of the classical Theta method, with a few modifications. The workflow matches the one in the forecast package:
fit <- theta(referrals)
frc <- forecast.theta(fit,h=20)
plot(frc)
This implementation of the Theta method tests for seasonality and trend. Seasonal decomposition can be done either additively or multiplicatively and the seasonality is treated as a pure seasonal model. The various Theta components can be optimised using different cost functions. The originally proposed Theta method always assumed multiplicative seasonality and presence of trend, while all theta lines were optimised using MSE and seasonality was estimated using classical decomposition and arithmetic mean. The various arguments of the function can control how theta()
operates: m
can be used to manually set the seasonal length, sign.level
the significance level for the statistical tests, cost0
to cost2
the cost function for the various components of the model, multiplicative
the type of decomposition used and cma
an optional pre-calculated Centred Moving Average used for decomposition and testing. Also note that since impolementation is not based on a model, it does not provide prediction intervals or information criteria.
If you are a user of thief (if not give it a try!) the function theta.thief()
integrates both packages:
library(thief)
thief(referrals,forecastfunction=theta.thief)
## Jan Feb Mar Apr May Jun Jul
## 2013 1035418.8
## 2014 955320.7 957015.3 1067922.5 984519.0 1019637.9 1027866.9 1040235.2
## 2015 959464.4 961167.2 1072865.8 988823.4 1024207.9 1032495.2
## Aug Sep Oct Nov Dec
## 2013 943204.9 991756.3 1031436.6 987854.5 819194.1
## 2014 947299.6 996207.8 1036200.5 992380.8 822408.6
## 2015
In this case temporal hierarchy forecasts are constructed using the Theta method, as implmemented in tsutils, at all temporal aggregation levels.
One of the major tasks in producing the appropriate forecasts for a time series is evaluating the performance of different models/methods. The nemenyi()
function is helpful to perform nonparametric multiple comparisons testing. There are two considerations. First, some difference in summary performance statistics does not always mean that one forecast is indeed preferable, as differences may still be due to randomeness, and more evidence is necessary. Second, doing repeated pairwise testing between methods does not result in the desired significance in the results, as by doing this repetition is distorting your testing probabilities. The nemenyi()
function first performs a Friedman test to establish that at least one method performs significantly different and then uses the post-hoc Nemenyi test to group methods by similar performance. Both tests are nonparametric, and therefore the distribution of the performance metric is not an issue.
x <- matrix( rnorm(50*4,mean=0,sd=1), 50, 4) # Generate some performance statistics
x[,2] <- x[,2]+1
x[,3] <- x[,3]+0.7
x[,4] <- x[,4]+0.5
colnames(x) <- c("Method A","Method B","Method C - long name","Method D")
nemenyi(x)
## Friedman and Nemenyi Tests
## The confidence level is 5%
## Number of observations is 50 and number of methods is 4
## Friedman test p-value: 2e-04 - Ha: Different
## Critical distance: 0.6633
The plot is fairly easy to read. Any for methods that are joined by a horizontal there is no evidence of statistically significant differences. There are multiple lines, as we can start measuring from different methods. The test is simple. For any method that is within plus/minus the critical distance (CD in the plot) from the mean rank (the values next to method names) there is no evidence of differences. The function provides different plot styles, using the argument plottype
. This can be used to minic the “multiple comparison with the best” (plottype="mcb"
or plottype="vmcb"
for vertical alternative), the “line plot” above (plottype="line"
or plottype="vline"
), a detailed comparison (plottype="matrix"
) or output no visualisation (plottype="none"
).
The mcb
visualisation shows the performance of the various methods seperately. The dot in the middle is the mean rank of the method and the lines span te critical distance (half above and half below). For any other method that its boundaries are crossed there is no evidence of significant differences. The matrix
shows on the vertical axis all methods sorted according to mean rank. On the horizontal axis they are sorted as in the input array. The balck cell is the method from which we measure and in a row (or column) any methods in blue belong to the same group of perofrmance. For example, in all the plots above, method A has no differences from d and c. Method d similarly is grouped together with a and c. When we measure from c, on the other hand, we have no evidence for significant differences from any method. When we measure from b, then both a and d have significant differences. If we are interested in the best performing group then the mcb
visualisation is the simplest to consult. Otherwise matrix
is the most detailed and line
/vline
is a compromise between the two.
Forecasting process modelling
Beyond various modelling functions, the package tsutils offers some functions for analysing the forecasting process. More specifically it helps to conduct ABC-XYZ analyses. This is done using the functions abc()
, xyz()
and abcxyz()
.
Starting with the ABC part of the analysis, given an array of time series, where each column is a time series, you can use:
x <- abs(matrix(cumsum(rnorm(5400,0,1)),36,150)) # Construct some random data
ABC <- abc(x)
print(ABC)
## ABC analysis
## Importance %
## A (20%): 39.678
## B (30%): 43.061
## C (50%): 17.261
We can also plot the output, providing the concentration in each class.
plot(ABC)
Alternatively, the input can be a vector that contains the importance index for each series. The reader is referred to Ord K., Fildes R., Kourentzes N. (2017) Principles of Business Forecasting, 2e. Wessex Press Publishing Co., p.515-518 for details of how to use the ABC analysis, as well as the XYZ analysis that follows.
The user can set custom percentages for each class, or change the number of classes with the argument prc
. For example:
plot(abc(x,prc=c(0.1,0.15,0.25,0.5)))
The function xyz()
looks at the forecastability side. In addition to the arguments in abc()
the user can select the type of forecast used to generate errors (type
) and the seasonal periodicity of the series (m
). For a discussion see the aforementioned reference and here. By default the naive and seasonal naive methods are used (type="naive"
), but the user can opt to rely on exponential smoothing (type="ets"
). The option to use the coefficient of variation is available (type="cv"
), but I strongly advise against using it (see the previous references).
XYZ <- xyz(x,m=12)
plot(XYZ)
One can combine both analyses with the function abcxyz()
. This can also accept a vector of forecast errors (with the argument error
) that are distributed to each class of items. This vector must be equal length to the columns used to construct the ABC and XYZ analyses.
abcxyz(ABC,XYZ)
## $class
## Z Y X
## A 0 0 30
## B 0 0 45
## C 30 45 0
##
## $error
## NULL
Quality of life functions
The package includes some QoL functions. The function geomean()
calculates the geometric mean and accepts the same arguments as the mean()
.
geomean(c(0.5,1,1.5))
## [1] 0.9085603
mean(c(0.5,1,1.5))
## [1] 1
The function leadtrail()
can remove leading or trailing zeros or NAs as requested. This is controlled by the argumentsrm
, lead
and trail
.
x <- c(rep(0,2),rnorm(5),rep(0,2))
print(x)
## [1] 0.0000000 0.0000000 -0.9519769 -1.1978381 -0.9596107 -1.6142629
## [7] 0.5855328 0.0000000 0.0000000
leadtrail(x)
## [1] -0.9519769 -1.1978381 -0.9596107 -1.6142629 0.5855328
The function wins()
winsorises vectors and accepts the argument p
to control how many observations are treated. If p
is less than 1, then it is interpretted as a percentage. If it is bigger or equal to 1, then it is treated as number of observations.
x <- sort(round(runif(10)*10))
print(x)
## [1] 0 1 1 1 3 7 8 8 9 9
wins(x,2)
## [1] 1 1 1 1 3 7 8 8 8 8
wins(x,0.1)
## [1] 1 1 1 1 3 7 8 8 9 9
There are two vectorised versions colWins()
and rowWins()
to speed up calculations for large arrays.
Finally, the function lambdaseq()
is meant to be used with LASSO and similar. Given some response target y
and regressors x
it calculates a sequence of lambdas from the OLS solution to the penalty that removes all regressors from the model. If weighted LASSO will be used, then the argument weigth
can beused to account for that. To switch to ridge regression or elastic nets use the argument alpha
. How many lambdas are produced (nLambda
) and how the lambda decreases (lambdaRatio
) can be controlled. The argument standardise
preprocess the variables, while the arguement addZeroLamda
does as it says and adds the OLS solution lambda in the output vector.
y <- mtcars[,1]
x <- as.matrix(mtcars[,2:11])
lambda <- lambdaseq(x, y)[[1]]
plot(lambda)
This can be directly used with the glmnet package, or with custom shrinkage estimators (the motivation behind building this function!).
library(glmnet)
fit.lasso <- cv.glmnet(x, y, lambda = lambda)
Users of TStools will not find any new functionality in tsutils yet. Under the hood there are various improvements and better documentation. Where necessary, the functions point to useful references and examples. Most importantly, these functions are now in CRAN and safe to use! The plan is to keep expanding tsutils with more functions that graduate from the development TStools.
Happy forecasting!
Nikos, thank you for this post. Its very inspiring.
This is wonderful!
Your posts are really great.
Had a query regarding data transformation during time series modelling, We use different transformations such as log or box cox transformations before applying time series models.I would like to know is it fine to do scaling(multipying or dividing by a constant) on the time series before applying time series models(Models like arima, ets , stlf etc).
I do not see why scaling would affect linear models. We can always do some tricks to split the estimation. For instance in lasso regression we take away the mean and do not estimate it as part of the regression, but do it separately. You could in principle do that with ETS, Arima, etc. However, keep in mind that if you are using ETS with multiplicative specifications, you need to avoid introducing zeros and make any proportional changes (i.e. multiply/divide), as otherwise you will break the multiplicative connection between the components and the level!