In this document I compare the fit of the ‘INGARCH’ model and the GPAR(1) model for the campylobacterosis data.

The data set is the number of campylobacterosis infections per 28 days in the north of the Province Quebec in Canada. The data start January 1990 and end October 2000. There are 13 observations per year. The number of observations is 140.

The data set has been introduced into the time series literature by Ferland, Latour, Oriachi (2006) who do provide the following remark as a source of the data: “The authors thank Pierre Boivin from the Infectious Diseases Services of the Public Health Department in Roberval (Que´bec) Canada, who provided the data set.” No link, no formal reference is provided.

I failed to locate the data somewhere ‘in the internet’ and so far have found no more detailed description on how the data are collected or recorded; in particular, if they are incidence counts or of a stock type. But as infection with campy-bacteria is a notifiable disease, I guess that the data are incidence counts.

Looking at the SACF, which is clearly influenced by at least the ‘outlier’, I would say the pronounced serial correlation structure, in particular the large first order autocorrelation, makes an argument for trying out a model from the GPAR-class.

In this document I first replicate the results presented in the JSS paper by Liboschik, Fokianos and Fried (2017). They fitted an INGARCH(1) model with a stochastic seasonal component (lag 13 conditional mean) and two interventions dummies. One is to take out the huge spike (outlier) observerd at data point no 100 and one the level shift from observation 84 onwards. Both modelling choices are motivated in the Fokianos and Fried (2010) paper.

Descriptive statistics of the data

Descriptive statistics for the time series \(campy\) including a time series plot and S(P)ACF plots:

## Mean of the data:  11.54286
## Variance of the data:  53.24275

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Replication of the results published in Liboschek et al 2017, JSS

We now fit exactly the same regression model to the data using the \(tscount\) package. Note, that the regressors enter the conditional mean function in a linear fashion. This is feasible, as the conditional mean including the regressors are always non-negative.

## 
## Call:
## tsglm(ts = campy, model = list(past_obs = 1, past_mean = 13), 
##     xreg = interventions, distr = "nbinom")
## 
## Coefficients:
##              Estimate  Std.Error  CI(lower)  CI(upper)
## (Intercept)    3.3184     0.7851     1.7797      4.857
## beta_1         0.3690     0.0696     0.2326      0.505
## alpha_13       0.2198     0.0942     0.0352      0.404
## interv_1       3.0810     0.8560     1.4032      4.759
## interv_2      41.9541    12.0914    18.2554     65.653
## sigmasq        0.0297         NA         NA         NA
## Standard errors and confidence intervals (level =  95 %) obtained
## by normal approximation.
## 
## Link function: identity 
## Distribution family: nbinom (with overdispersion coefficient 'sigmasq') 
## Number of coefficients: 6 
## Log-likelihood: -381.0839 
## AIC: 774.1678 
## BIC: 791.8176 
## QIC: 787.6442

This is, of course, exactly the result published in the paper. Note, however, that there seems to be a typo in the Luboschik et al JSS paper on page 18. In the display equation, the sway the time indices for the two intervention terms. The spike in the data occurs at time point 100 and the level shift is taken from observation 84 up to the end of the data set.

The main diagnostics:

## Mean of the Pearson residuals:  0.007203263
## Variance of the Pearson residuals: 0.9711706

A one-step ahead prediction. Note, that the \(tscount\) package provides a non-coherent point prediction instead of a coherent density prediction.

## Time Series:
## Start = c(2000, 11) 
## End = c(2000, 11) 
## Frequency = 13 
## [1] 13.12045
## Time Series:
## Start = c(2000, 11) 
## End = c(2000, 11) 
## Frequency = 13 
##          lower upper
## 2000.769     6    22

I infer from this rather cryptic output, that the non-coherent point prediction is 13.12. The output for the bounds is even more cryptic (seems rounded), but shows a rather wide (95%) prediction interval.

Estimation of the campy data using the coconots package.

We now turn to the \(coconots\) package and fit a GPAR(1) model. The seasonal pattern is captured by the usual harmonics. Note, as there are 13 observations per year, the sine and cosine terms are adjusted accordingly. Note, that I have to use a logarithmic link, as the harmonics contain negative elements.

## Starting Julia ...
## Coefficients:
##         Estimate   Std. Error         t
## alpha     0.2839       0.0706    4.0179
## eta       0.1366       0.0537    2.5448
## const     1.6325       0.1354   12.0551
## intv1     0.5874       0.0801    7.3365
## intv2     1.2593       0.1971    6.3884
## cos_x    -0.0907       0.0592   -1.5327
## sin_x    -0.2602       0.0575   -4.5210
## 
## Type: GP 
## Order: 1 
## 
## Log-likelihood: -372.8559

All the parameters are statistically significantly different from zero. The harmonics are jointly significant. The \(intv2\) term, which was numerically large in the above INGARCH model is now much smaller; but that is probably not surprising, given the log-link.

We turn to the diagnostics:

## Mean of the Pearson residuals: -0.002
## Variance of the Pearson residuals:  1.054

My interpretation: The GPAR1 model captures the dynamics well at least comparable to the INGARCH model. The PIT histogram does not signal a misspecification.

We finally turn to a one-step ahead prediction:

## Mode of the forecast density: 14
## Median of the forecast density:  15

Finally, look at the values of the scoring rules:

## Scoring rules from the tscount package:
## logarithmic   quadratic   spherical    rankprob      dawseb      normsq 
##  2.72202783 -0.07799704 -0.27659087  2.18510509  3.60593480  0.96428551 
##     sqerror 
## 16.51281656
## Scoring rules from the coconts package:
## $log.score
## [1] 2.682416
## 
## $quad.score
## [1] -0.08487704
## 
## $rps.score
## [1] 2.089507