This vignette demonstrates basic features of RTMB by
implementing a random regression model from scratch and fit it
to a built-in R dataset.
We’ll take as starting point the built-in data set ‘ChickWeight’:
The data contains weight of 50 chicks by
Time and individual identifier Chick. A useful
plot of growth by individual is available from the help page by running
example(ChickWeight).
Although not the most natural based on the plot, we’ll setup a random regression model. It is defined by first drawing slopes and intercepts by individual from an underlying distribution
\[ a_1,...,a_{50} \sim N(\mu_a , \sigma_a^2) \]
\[ b_1,...,b_{50} \sim N(\mu_b , \sigma_b^2) \]
and then state a normal regression given these random coefficients:
\[ \text{weight}_{i} \sim N(a_{\text{Chick}_i} * \text{Time}_i + b_{\text{Chick}_i}, \sigma^2) \]
To implement the model in RTMB we have to set up the objective function and define the parameters and random effects.
Parameter objects are gathered in a list that also serves as initial guess when fitting the model:
parameters <- list(
    mua=0,          ## Mean slope
    sda=1,          ## Std of slopes
    mub=0,          ## Mean intercept
    sdb=1,          ## Std of intercepts
    sdeps=1,        ## Residual Std
    a=rep(0, 50),   ## Random slope by chick
    b=rep(0, 50)    ## Random intercept by chick
)The objective function takes as input such a parameter list, and is defined by
f <- function(parms) {
    getAll(ChickWeight, parms, warn=FALSE)
    ## Optional (enables extra RTMB features)
    weight <- OBS(weight)
    ## Initialize joint negative log likelihood
    nll <- 0
    ## Random slopes
    nll <- nll - sum(dnorm(a, mean=mua, sd=sda, log=TRUE))
    ## Random intercepts
    nll <- nll - sum(dnorm(b, mean=mub, sd=sdb, log=TRUE))
    ## Data
    predWeight <- a[Chick] * Time + b[Chick]
    nll <- nll - sum(dnorm(weight, predWeight, sd=sdeps, log=TRUE))
    ## Get predicted weight uncertainties
    ADREPORT(predWeight)
    ## Return
    nll
}This function calculates the negative log likelihood nll
using straight forward R operations corresponding exactly to the model
definition. In addition, some RTMB specific statements are used:
getAll function makes all the list elements of data
and parameters visible inside the function, so that one can write
e.g. weight rather than
ChickWeight$weight.weight <- OBS(weight) statement tells RTMB that
that weight is the response. This is needed to enable
automatic simulation and residual calculations from the model
object.ADREPORT(predWeight) statement tells RTMB that we
want uncertainties for this intermediate calculation.The objective function f is processed by RTMB using the
call
where we also specify that a and b are
random effects.
We optimize the model using nlminb (or any other
suitable gradient based optimizer in R)
Uncertainties are now calculated using
## sdreport(.) result
##        Estimate Std. Error
## mua    8.467355  0.5010482
## sda    3.463551  0.3630097
## mub   29.044188  1.8040325
## sdb   10.541116  1.5588779
## sdeps 12.890654  0.4237485
## Maximum gradient component: 1.117405e-05By default, the shown output is very brief, containing only the model
parameters. The sdr object contains much more. It is often
convenient to inspect parameter estimates and other output as a list
similar to the one containing the parameters. For instance, to get
parameters estimates and standard errors as separate lists use:
Pass report=TRUE to get ADREPORTed
quantities:
New datasets can be generated from the estimated model using
obj$simulate() assuming that the model is implemented in
accordance with the principles defined in the help page
?Simulation.
When building random effect models from scratch, many mistakes can be
made leading to wrong results. We always recommend to run the model
through the completely automatic consistency check (this requires that
obj$simulate() works for the implementation). By default,
the check simulates 100 datasets and calculates gradients for each
replicate. A standard output tells you whether the implementation is
consistent with simulation (message ‘simulation appears to be correct’).
It also gives an idea of the parameter bias usually caused by the
Laplace approximation (not an issue for the random regression model for
which the Laplace approximation is exact). We run the standard check
by:
## Parameters used for simulation:
##       mua       sda       mub       sdb     sdeps 
##  8.467355  3.463551 29.044188 10.541116 12.890654 
## 
## Test correct simulation (p.value):
## [1] 0.916118
## Simulation appears to be correct
## 
## Estimated parameter bias:
##          mua          sda          mub          sdb        sdeps 
##  0.041037447 -0.001069032 -0.092780844  0.057738402  0.027244614As expected for this case, everything looks fine. A complete simulation study, that re-estimates parameters for each replicate, can also be performed although that takes longer to run:
For more details we refer to the help page
?TMB::checkConsistency.
Quantile residuals can be generated automatically using the
oneStepPredict function. These residuals are conceptually
superior to other methods (e.g. Pearson residuals), but much trickier to
calculate. It is very important to specify an appropriate
method (see ?TMB::oneStepPredict) because
using an inappropriate method can give wrong residuals. For the random
regression model the ‘fullGaussian’ method is computationally exact.
The argument discrete=FALSE is necessary in this case
because the data has duplicates, which is a zero-probability event for
continuous distributions. If the model is correct, the residuals are
standard, independent normally distributed, which is obviously not the
case here.
For the random regression model, we could run MakeADFun
without problems. However, model implementation might not always work as
smoothly. In case of errors it is useful to test the implementation
step-wise.
Going back to our objective function f, first step
is to check that you can evaluate the function as a normal R
function:
An error at this point is obviously not due to
RTMB.
Next, it is useful to check that MakeADFun can be
run without random effects:
Should an error occur at this point, you can enable standard R
debugging debug(f) and run MakeADFun again to
figure out which code line caused the error. A common cause is to that
an unsupported (e.g. non-differentiable) operation has been
used.
Once obj has been constructed successfully, you
should evaluate it
and verify that it gives the same result as
f(parameters).
The random regression model could alternatively have been written
using the RTMB ‘tilde operator’ (%~%):
f2 <- function(parms) {
    getAll(ChickWeight, parms, warn=FALSE)
    ## Optional (enables extra RTMB features)
    weight <- OBS(weight)
    ## Random slopes
    a %~% dnorm(mean=mua, sd=sda)
    ## Random intercepts
    b %~% dnorm(mean=mub, sd=sdb)
    ## Data
    predWeight <- a[Chick] * Time + b[Chick]
    weight %~% dnorm(predWeight, sd=sdeps)
    ## Get predicted weight uncertainties
    ADREPORT(predWeight)
}This syntax is closer to other probabilistic languages (e.g. BUGS,
JAGS and Stan). But more importantly, it prevents some very common TMB
mistakes, by passing log=TRUE automatically and making sure
the sign of the objective function is correct.
Otherwise, f2 is identical to f, and the
model object can be constructed and fitted in the same way:
A common cause of confusion is that the RTMB version of
MakeADFun does not have a data argument. What
if we want to change the data? This can be handled in R using
closures. Start by adding an explicit data argument to the
objective function:
f3 <- function(parms, data) {
    getAll(data, parms, warn=FALSE)
    ## Optional (enables extra RTMB features)
    weight <- OBS(weight)
    ## Random slopes
    a %~% dnorm(mean=mua, sd=sda)
    ## Random intercepts
    b %~% dnorm(mean=mub, sd=sdb)
    ## Data
    predWeight <- a[Chick] * Time + b[Chick]
    weight %~% dnorm(predWeight, sd=sdeps)
    ## Get predicted weight uncertainties
    ADREPORT(predWeight)
}A general function to combine this objective with a specific data set is:
We can now easily create model objects using different datasets:
We’ve already seen how ADREPORT() can be used to make
predictions of derived quantities e.g. predWeight
corresponding to each of our measurements. But what if we need
predictions outside of where we have data? For example, some natural
prediction tasks could be:
10 for times 20:24 using diet
3.51.A general approach to make such predictions is to modify the
objective function to handle rows with missing (NA)
response, and then augment the original data with missing responses
where predictions are needed. We’ll take as starting point the
f3() implementation and modify (changes are marked with a
comment):
f_pred <- function(parms, data) {
    getAll(data, parms, warn=FALSE)
    ## Optional (enables extra RTMB features)
    weight <- OBS(weight)
    ## Random slopes
    a %~% dnorm(mean=mua, sd=sda)
    ## Random intercepts
    b %~% dnorm(mean=mub, sd=sdb)
    ## Data
    predWeight <- a[Chick] * Time + b[Chick]
    mis <- is.na(weight)                                 ## <-- Get missing responses
    weight[!mis] %~% dnorm(predWeight[!mis], sd=sdeps)   ## <-- Likelihood of non-missing
    ## Get predicted weight uncertainties
    ADREPORT(predWeight[mis])                            ## <-- Only report missing
}For the first prediction task we create a newdata object
with missing response:
We then refit the model using the original data augmented
with newdata (Note: refitting can be avoided - see
next section).
augdata <- rbind(ChickWeight, newdata)
obj <- MakeADFun(cmb(f_pred, augdata), parameters, random=c("a", "b"))
opt <- nlminb(obj$par, obj$fn, obj$gr)
sdr <- sdreport(obj)
cbind(newdata, summary(sdr, select="report"))##                   Time Diet Chick weight Estimate Std. Error
## predWeight.mis.     20    3    10     NA 121.8091   6.064611
## predWeight.mis..1   21    3    10     NA 126.1532   6.463963
## predWeight.mis..2   22    3    10     NA 130.4973   6.874058
## predWeight.mis..3   23    3    10     NA 134.8414   7.293083
## predWeight.mis..4   24    3    10     NA 139.1855   7.719585The second prediction task is slightly more tricky as it involves an
unseen chick number 51. That is, we need to add additional random
effects a and b corresponding to this
individual.
newdata <- transform(newdata, Chick="51")
augdata <- rbind(ChickWeight, newdata)
parameters$a <- rep(0, nlevels(augdata$Chick))
parameters$b <- rep(0, nlevels(augdata$Chick))This is all we need and we could in principle proceed as before by
refitting the model and running sdreport(). However, a
shortcut is worth mentioning: The previous fit of parameters
and hessian can be re-used as these can’t possibly change by
adding missing responses to the data:
We can then pass this information to sdreport and skip
the optimization step:
obj <- MakeADFun(cmb(f_pred, augdata), parameters, random=c("a", "b"))
sdr <- sdreport(obj, par.old, hessian.old)
cbind(newdata, summary(sdr, select="report"))##                   Time Diet Chick weight Estimate Std. Error
## predWeight.mis.     20    3    51     NA 198.3913   70.78365
## predWeight.mis..1   21    3    51     NA 206.8587   74.24496
## predWeight.mis..2   22    3    51     NA 215.3260   77.70971
## predWeight.mis..3   23    3    51     NA 223.7934   81.17744
## predWeight.mis..4   24    3    51     NA 232.2607   84.64780