Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
Preparation
R
and RStudio
are installed on your computer.RStudio
session and run the following in the console: baseurl <- "http://kinglab.eeb.lsa.umich.edu/ICTPWID/SaoPaulo_2015/"
script <- "scripts/packages.R"
source(paste0(baseurl,script))
This script will try to install several needed packages. You may be prompted for a “CRAN mirror”: choose one here in Sao Paulo.
This course will focus on the use of models for understanding, predicting, and controlling infectious disease systems. Models play a central role in this work because they allow us to precisely and quantitatively express our ideas about the mechanisms of infectious disease transmission, immunity, and ecology. To the extent our understanding of the important mechanisms is correct, such models are extremely useful in the design of policy. On the other hand, models are useful in uncovering the important mechanisms, inasmuch as we can compare model predictions to data. Specifically, if we can translate competing hypotheses into mathematical models, these can be compared in terms of their ability to correctly capture the patterns seen in data.
In order to fairly compare competing models, we must first try to
find out what is the best each model can do. Practically speaking, this
means that we have to find the values of the models’ parameters that
give the closest correspondence between model predictions and data.
Parameter estimation can be important even when we are fairly confident
in the ability of a single model to explain the dynamics. Not
surprisingly, the all-important quantity
For a simple, closed SIR outbreak, we can derive an expression that determines the final size of the outbreak, i.e., the total number of hosts infected. To do this, note that if
The following shows the relationship between final size and
f <- seq(0,1,length=100)
R0 <- -log(1-f)/f
plot(f~R0,type='l',xlab=expression(R[0]),ylab="fraction infected",bty='l')
During the early stages of an SIR outbreak, the number of infected individuals
We can plot the 1977 boarding school influenza data (Anonymous 1978) in this way to see if this is the case.
baseurl <- "http://kinglab.eeb.lsa.umich.edu/ICTPWID/SaoPaulo_2015/"
url <- paste0(baseurl,"data/boarding_school_influenza.csv")
bbs <- read.csv(url,comment.char="#",colClasses=c(date="Date"))
bbs <- subset(bbs,status=="confined",select=-status)
bbs$day <- as.numeric(bbs$date-min(bbs$date))
## or download the file and do:
## bbs <- read.csv("FILE.csv",comment.char="#",colClasses=c(date="Date"))
plot(cases~day,data=bbs,type='b',bty='l',
main='boarding school influenza outbreak',
xlab='day',ylab='Influenza cases')
plot(cases~day,data=bbs,type='b',log='y',bty='l',
xlab='day',ylab='Influenza cases')
Plotted on a log scale, the linearity of the first several data
points is indeed striking. This suggests that we can obtain a cheap and
cheerful estimate of
fit <- lm(log(cases)~day,data=subset(bbs,day<=4))
summary(fit)
##
## Call:
## lm(formula = log(cases) ~ day, data = subset(bbs, day <= 4))
##
## Residuals:
## 1 2 3 4 5
## -0.28779 0.17357 0.30950 0.01146 -0.20673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.28779 0.22432 1.283 0.289656
## day 1.33041 0.09158 14.527 0.000707 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2896 on 3 degrees of freedom
## Multiple R-squared: 0.986, Adjusted R-squared: 0.9813
## F-statistic: 211 on 1 and 3 DF, p-value: 0.0007072
coef(fit)
## (Intercept) day
## 0.2877876 1.3304055
slope <- coef(fit)[2]
slope
## day
## 1.330405
Now, we know that influenza’s infectious period is about 2.5 da.
Moreover, since this is far shorter than an average human life (
Our strategy in this case has been to redefine the problem so that it
fits a standard form, i.e., we showed how to rearrange the model so
that the relevant quantity (
coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2877876 0.22432360 1.282913 0.289655902
## day 1.3304055 0.09157973 14.527292 0.000707225
slope.se <- coef(summary(fit))[2,2]
2.5*slope.se
## [1] 0.2289493
So we reckon we’ve got an error of
A defect of this method is that it uses only a small amount of the
data to compute an important quantity. Moreover, we have to make a
subjective judgement as to how much of the data to use. Further, as we
use more data, and presumably obtain more precise estimates, we
simulataneously get further from the realm where our approximation is
valid, which introduces greater bias. Let’s see how our estimates of
days <- 2:10
slope <- numeric(length=length(days))
slope.se <- numeric(length=length(days))
for (k in seq_along(days)) {
fit <- lm(log(cases)~day,data=subset(bbs,day<=days[k]))
slope[k] <- coef(summary(fit))[2,1]
slope.se[k] <- coef(summary(fit))[2,2]
}
R0.hat <- slope*2.5+1
R0.se <- slope.se*2.5
plot(slope~days,type='o')
We’ll plot these estimates against their uncertainties to show this precision-accuracy tradeoff.
plot(range(days),
range(c(R0.hat-2*R0.se,R0.hat+2*R0.se),na.rm=T),
type='n',bty='l',
xlab="length of initial phase (da)",
ylab=expression("estimated"~R[0]))
lines(R0.hat~days,type='o',lwd=2)
lines(R0.hat+2*R0.se~days,type='l')
lines(R0.hat-2*R0.se~days,type='l')
Biweekly data for outbreaks of measles in three communities within Niamey, Niger (Grais et al. 2006) are provided in the file http://kinglab.eeb.lsa.umich.edu/ICTPWID/SaoPaulo_2015/data/niamey.csv. Use this method to obtain estimates of
url <- paste0(baseurl,"data/niamey.csv")
niamey <- read.csv(url,comment.char="#")
plot(measles~biweek,data=niamey,subset=community=="A",type='b')
plot(measles~biweek,data=niamey,subset=community=="B",type='b')
plot(measles~biweek,data=niamey,subset=community=="C",type='b')
R
Here we begin our study of computational techniques for studying
epidemiological models. In this session we introduce the numerical
solution (or integration) of nonlinear differential equations using the
sophisticated solvers found in the package deSolve
. Numerical integration is one of the most important tools we have for the analysis of epidemiological models.
As we saw in the lecture, the classical SIR compartmental model
divides a population of hosts into three classes: susceptible, infected,
recovered. The model describes how the fraction of a population in each
of these classes changes with time. Alternatively, the model can track
the number of individuals in each class. Births are modeled as flows
from “nowhere” into the susceptible class; deaths are modeled as flows
from the S, I, or R compartment into “nowhere”. If
Like almost all epidemiological models, one can’t solve these equations analytically. However, we can compute the trajectories
of a continuous-time model such as this one by integrating the
equations numerically. Doing this accurately involves a lot of
calculation, and there are smart ways and not-so-smart ways of going
about it. This very common problem has been very thoroughly studied by
numerical analysts for generations so that, when the equations are
smooth, well-behaved functions, excellent numerical integration
algorithms are readily available to compute approximate solutions to
high precision. In particular, R
has several sophisticated
ODE solvers which (for many problems) will give highly accurate
solutions. These algorithms are flexible, automatically perform checks,
and give informative errors and warnings. To use the numerical
differential equation solver package, we load the deSolve
package
require(deSolve)
The ODE solver we’ll use is called ode
. Let’s have a look at the help page for this function.
?ode
ode
needs to know the initial values of the state variables (y
), the times
at which we want solutions, the right-hand side of the ODE func
. The latter can optionally depend on some parameters (parms
).
Let’s study the SIR model for a closed population, i.e., one in which
we can neglect births and deaths. Recall that the differential
equations for the closed epidemic are func
argument to ode
, we’ll need to write a function. For example:
closed.sir.model <- function (t, x, params) {
## first extract the state variables
S <- x[1]
I <- x[2]
R <- x[3]
## now extract the parameters
beta <- params["beta"]
gamma <- params["gamma"]
## now code the model equations
dSdt <- -beta*S*I
dIdt <- beta*S*I-gamma*I
dRdt <- gamma*I
## combine results into a single vector
dxdt <- c(dSdt,dIdt,dRdt)
## return result as a list!
list(dxdt)
}
Note that the order and type of the arguments and output of this function must exactly match ode
’s expectations. Thus, for instance, the time variable t
must be the first argument even if, as is the case here, nothing in the
function depends on time. [When the RHS of the ODE are independent of
time, we say the ODE are autonomous.] Note also, that ode
expects the values of the ODE RHS to be the first element of a list
.
Now we can call ode
to compute trajectories of the
model. To do this, we’ll need some values of the parameters. If we’re
thinking of a disease something like measles, and measuring time in
years, we might use something like:
params <- c(beta=400,gamma=365/13)
What is the infectious period of this disease? What is
We now state the times at which we want solutions and specify the initial conditions, i.e., the starting values of the state variables
times <- seq(from=0,to=60/365,by=1/365/4) # returns a sequence
xstart <- c(S=0.999,I=0.001,R=0.000) # initial conditions
Next, we compute a model trajectory with the ode
command and store the result in a data-frame:
require(deSolve)
out <- as.data.frame(
ode(
func=closed.sir.model,
y=xstart,
times=times,
parms=params
)
)
and plot the results using the commands:
plot(I~time,data=out,type='l')
Suppose that you’d rather measure time in days. Modify the parameters accordingly and verify your modifications.
Let’s study how the epidemic curve depends on the transmission rate,
betavals <- c(20,50,500)
ips <- c(5,10,30)
gammavals <- 365/ips
## set some plot parameters
op <- par(mgp=c(2,1,0),mar=c(3,3,1,1),mfrow=c(3,3))
for (beta in betavals) {
for (gamma in gammavals) {
params <- c(beta=beta,gamma=gamma)
out <- as.data.frame(
ode(
func=closed.sir.model,
y=xstart,
times=times,
parms=params
)
)
title <- bquote(list(beta==.(beta),"IP"==.(365/gamma)~"da"))
plot(I~time,data=out,type='l',main=title)
}
}
par(op) # restore old settings
Simulation is a useful tool, but its power is limited. The next exercise demonstrates the importance of being able to analyze the equations as well.
For each of the above parameter combinations, describe the system’s behavior. Compute
Use the ODE solver to study the dependence of the epidemic’s final size on
Over a sufficiently short time scale, the assumption that the
population is closed is reasonable. To capture the dynamics over the
longer term, we’ll need to account for births and deaths, i.e., allow
the population to be an open one. As we’ve seen, if we further assume that the birth rate equals the death rate, then the SIR equations become
We must modify the ODE function accordingly:
open.sir.model <- function (t, x, params) {
beta <- params["beta"]
mu <- params["mu"]
gamma <- params["gamma"]
dSdt <- mu*(1-x[1])-beta*x[1]*x[2]
dIdt <- beta*x[1]*x[2]-(mu+gamma)*x[2]
dRdt <- gamma*x[2]-mu*x[3]
list(c(dSdt,dIdt,dRdt))
}
We’ll need to specify a birth/death rate in addition to the two parameters we specified before:
params <- c(mu=1/50,beta=400,gamma=365/13)
We integrate the equations as before:
times <- seq(from=0,to=25,by=1/365)
out <- as.data.frame(
ode(
func=open.sir.model,
y=xstart,
times=times,
parms=params
)
)
We can plot each of the state variables against time, and
op <- par(fig=c(0,1,0,1),mfrow=c(2,2),
mar=c(3,3,1,1),mgp=c(2,1,0))
plot(S~time,data=out,type='l',log='y')
plot(I~time,data=out,type='l',log='y')
plot(R~time,data=out,type='l',log='y')
plot(I~S,data=out,log='xy',pch='.',cex=0.5)
par(op)
Explore the dynamics of the system for different values of the
Under the assumptions of this model, the average host lifespan is
Explore how host lifespan affects the dynamics by integrating the differential equations for lifespans of 20 and 200 years.
The compartmental modeling strategy can be put to use in modeling a tremendous range of infections. The following exercises make some first steps in this direction.
The SIR model assumes lifelong sterilizing immunity following
infection. For many infections, immunity is not permanent. Make a
compartment diagram for an SIRS model, in which individuals lose their
immunity after some time. Write the corresponding differential equations
and modify the above codes to study its dynamics. Compare the SIR and
SIRS dynamics for the parameters
Make a diagram, write the equations, and study the dynamics of the
SEIR model for the dynamics of an infection with a latent period.
Compare the dynamics of SIR and SEIR models for the parameters
The simple SIR model always predicts damped oscillations towards an equilibrium (or pathogen extinction if
seas.sir.model <- function (t, x, params) {
beta0 <- params["beta0"]
beta1 <- params["beta1"]
mu <- params["mu"]
gamma <- params["gamma"]
beta <- beta0*(1+beta1*cos(2*pi*t))
dSdt <- mu*(1-x[1])-beta*x[1]*x[2]
dIdt <- beta*x[1]*x[2]-(mu+gamma)*x[2]
dRdt <- gamma*x[2]-mu*x[3]
list(c(dSdt,dIdt,dRdt))
}
params <- c(mu=1/50,beta0=400,beta1=0.15,gamma=365/13)
xstart <- c(S=0.07,I=0.00039,R=0.92961)
times <- seq(from=0,to=30,by=7/365)
out <- as.data.frame(
ode(
func=seas.sir.model,
y=xstart,
times=times,
parms=params
)
)
op <- par(fig=c(0,1,0,1),mfrow=c(2,2),
mar=c(3,3,1,1),mgp=c(2,1,0))
plot(S~time,data=out,type='l',log='y')
plot(I~time,data=out,type='l',log='y')
plot(R~time,data=out,type='l',log='y')
plot(I~S,data=out,log='xy',pch='.',cex=0.5)
par(op)
Explore the dynamics of the seasonally forced SIR model for increasing amplitude
When a covariate forces the equations, we must interpolate the
covariate. To give an example, let’s suppose that the transmission rate
depends on rainfall,
rain <- read.csv(paste0(baseurl,"data/dacca_rainfall.csv"))
rain$time <- with(rain,year+(month-1)/12)
plot(rainfall~time,data=rain,type='l')
Let’s assume that transmission depends on rainfall according to R
affords us numerous ways of doing this. The R
functions approxfun
and splinefun
construct interpolating functions in different ways; see the documentation on these functions.
interpol <- with(rain,approxfun(time,rainfall,rule=2,method='constant'))
data.frame(time=seq(from=1920,to=1930,by=1/365)) -> smoothed.rain
smoothed.rain$rainfall <- sapply(smoothed.rain$time,interpol)
plot(rainfall~time,data=rain,col='black',cex=2,pch=16,log='')
lines(rainfall~time,data=smoothed.rain,col='red')
rain.sir.model <- function (t, x, params) {
a <- params["a"]
b <- params["b"]
mu <- params["mu"]
gamma <- params["gamma"]
R <- interpol(t)
beta <- a*R/(b+R)
dSdt <- mu*(1-x[1])-beta*x[1]*x[2]
dIdt <- beta*x[1]*x[2]-(mu+gamma)*x[2]
dRdt <- gamma*x[2]-mu*x[3]
list(c(dSdt,dIdt,dRdt))
}
params <- c(a=500,b=50,mu=1/50,gamma=365/13)
xstart <- c(S=0.07,I=0.00039,R=0.92961)
times <- seq(from=1920,to=1930,by=7/365)
out <- as.data.frame(
ode(
func=rain.sir.model,
y=xstart,
times=times,
parms=params
)
)
op <- par(fig=c(0,1,0,1),mfrow=c(2,2),
mar=c(3,3,1,1),mgp=c(2,1,0))
plot(S~time,data=out,type='l',log='y')
plot(I~time,data=out,type='l',log='y')
plot(R~time,data=out,type='l',log='y')
plot(I~S,data=out,log='xy',pch='.',cex=1)
par(op)
More generally, any fitting method that has an associated predict
method can be used. For example, we can use local polynomial regression, loess
, to smooth the rainfall data. Do ?loess
and ?predict.loess
for more details.
fit <- loess(log1p(rainfall)~time,data=rain,span=0.05)
data.frame(time=seq(from=1920,to=1930,by=1/365)) -> smoothed.rain
smoothed.rain$rainfall <- expm1(predict(fit,newdata=smoothed.rain))
plot(rainfall~time,data=rain,col='black',cex=2,pch=16,log='')
lines(rainfall~time,data=smoothed.rain,col='red')
Now we move on to a much more general but considerably more complicated technique for estimating
We’ll illustrate this method using the Niamey data, which we’ll load and plot using the following commands:
url <- paste0(baseurl,"data/niamey.csv")
niamey <- read.csv(url,comment.char="#")
plot(measles~biweek,data=niamey,type='n')
lines(measles~biweek,data=subset(niamey,community=="A"),col=1)
lines(measles~biweek,data=subset(niamey,community=="B"),col=2)
lines(measles~biweek,data=subset(niamey,community=="C"),col=3)
legend("topleft",col=1:3,lty=1,bty='n',
legend=paste("community",c("A","B","C")))
Since this is a single outbreak, and the number of births and deaths
into the population over the course of the outbreak is small relative to
the size of the population, we can treat this outbreak as if it were
occurring in a closed population. We saw earlier how we can formulate
the SIR equations in such a case and how we can solve the model
equations to obtain trajectories. Before, we formulated the model in
terms of the fractions, deSolve
for the closed SIR epidemic is:
require(deSolve)
closed.sir.model <- function (t, x, params) {
X <- x[1]
Y <- x[2]
Z <- x[3]
beta <- params["beta"]
gamma <- params["gamma"]
pop <- params["popsize"]
dXdt <- -beta*X*Y/pop
dYdt <- beta*X*Y/pop-gamma*Y
dZdt <- gamma*Y
list(c(dXdt,dYdt,dZdt))
}
Thus far, we have only considered deterministic models. In the next lab, we will begin to think about more realistic models that begin to take into account some aspects of the stochastic nature of real epidemics. For now, under the assumption that the epidemic is deterministic, parameter estimation is a matter of finding the model trajectory that gives the best fit to the data. The first thing we need is a function that computes a trajectory given parameters of the model.
prediction <- function (params, times) {
xstart <- params[c("X.0","Y.0","Z.0")]
out <- ode(
func=closed.sir.model,
y=xstart,
times=times,
parms=params
)
out[,3] # return the number of infectives
}
Now we set up a function that will calculate the sum of the squared differences (or errors) between the data and the model predictions.
sse <- function (params, data) {
times <- c(0,data$biweek/26) # convert to years
pred <- prediction(params,times)
discrep <- pred[-1]-data$measles
sum(discrep^2) # sum of squared errors
}
To get a sense of what this gives us, let’s explore varying some
parameters and computing the SSE for community “A” of the Niamey data
set. To begin with, we’ll assume we know that
dat <- subset(niamey,community=="A")
params <- c(X.0=10000,Y.0=10,Z.0=39990,popsize=50000,
gamma=365/13,beta=NA)
f <- function (beta) {
params["beta"] <- beta
sse(params,dat)
}
beta <- seq(from=0,to=1000,by=5)
SSE <- sapply(beta,f)
We take our estimate,
beta.hat <- beta[which.min(SSE)]
We can plot SSE vs.
plot(beta,SSE,type='l')
abline(v=beta.hat,lty=2)
plot.window(c(0,1),c(0,1))
text(0.05,0.9,"A")
What does the SIR model predict at
params["beta"] <- beta.hat
plot(measles~biweek,data=dat)
lines(dat$biweek,prediction(params,dat$biweek/26))
Use this method to obtain estimates of
Clearly, this fit leaves much to be desired, but recall that we’ve
here assumed that we know the correct values for all parameters but
xdat <- subset(niamey,community=="A")
params <- c(X.0=NA,Y.0=10,Z.0=1,popsize=50000,
gamma=365/13,beta=NA)
f <- function (beta, X.0) {
params["beta"] <- beta
params["X.0"] <- X.0
sse(params,dat)
}
grid <- expand.grid(beta=seq(from=100,to=300,length=50),
X.0=seq(from=4000,to=20000,length=50))
grid$SSE <- with(grid,mapply(f,beta,X.0))
We can visualize this as a surface. A convenient function for this is contourplot
from the lattice
package:
require(lattice)
contourplot(sqrt(SSE)~beta+X.0,data=grid,cuts=30)
Discuss the shape of this surface: what does it tell us about the uncertainty in the model’s parameters?
Repeat the estimation using a closed SEIR model. Assume that the
infectious period is 5 da and the latent period is 8 da. How
and why does your estimate of
When we have more than two parameters to estimate (as we usually
will), we cannot rely on grid searches or simple graphical techniques to
find the region of parameter space with the best parameters. We need
more systematic ways of searching through the parameter space.
Mathematicians have devoted much study to optimization algorithms, and there are many of these. Many of them are implemented in R
.
The first place to go is the function optim
, which implements several common, well-studied, generally-useful optimization algorithms.
?optim
To use it, we have to specify the function we want to minimize and a starting value for the parameters. Starting from this point, optim
’s algorithms will search the parameter space for the value that minimizes the value of our objective function.
We’ll write an objective function to try to estimate
dat <- subset(niamey,community=="A")
params <- c(X.0=NA,Y.0=NA,Z.0=1,popsize=50000,
gamma=365/13,beta=NA)
f <- function (par) {
params[c("X.0","Y.0","beta")] <- par
sse(params,dat)
}
optim(fn=f,par=c(10000,10,220)) -> fit
fit
## $par
## [1] 9150.769987 1.017659 271.651714
##
## $value
## [1] 166584.5
##
## $counts
## function gradient
## 273 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
In the foregoing, we’ve estimated parameters by minimizing the sum of
squared differences between model-predicted number of cases and the
data. What would happen if we tried to minimize the squared error on the
log scale, i.e., to minimize
Try to estimate all four parameters at once. Start your algorithm from several places to check that they all converge to the same place. You may find it useful to restart the optimizer to verify its convergence.
The above plot of SSE against
Many other optimization algorithms exist. optim
implements several of these (see ?optim
). Other functions and packages you might look into include: constrOptim
, optimx
, nlm
, nlminb
, nloptr
(from the nloptr
package), and subplex
(from the subplex
package).
We have seen that fitting mechanistic models to data is a powerful and general approach to estimating parameters. We saw too that least-squares fitting, is a straightforward way to do this. However, several issues arose. First, there was an element of arbitrariness in the choice of discrepancy measure. Second, although we could fairly easily obtain point estimates of model parameters using least-squares, it was not clear how we could obtain concomitant estimates of parameter uncertainty (e.g., confidence intervals). Finally, we began to see that there are limits to our ability to estimate parameters. In this lab, we’ll explore these issues, and see that likelihood offers an attractive resolution to the first and second of these, but that the third is a fundamental challenge.
Likelihood has many advantages:
and some disadvantages:
Likelihood is the probability of a given set of data
A simple example: suppose
If the true seroprevalence was, say,
p <- 0.3
n <- 50
k <- seq(0,50,by=1)
prob <- dbinom(x=k,size=n,prob=p)
plot(k,prob,type='h',lwd=5,lend=1,
ylab="probability")
The likelihood is a function of the unknown parameters. In this case, if we assume
k1 <- 18
n1 <- 50
p <- seq(0,1,by=0.001)
plot(p,dbinom(x=k1,size=n1,prob=p,log=TRUE),
ylim=c(-10,-2),ylab="log-likelihood",
type='l')
abline(h=dbinom(x=k1,size=n1,prob=k1/n1,log=TRUE)-
0.5*qchisq(p=0.95,df=1),col='red')
abline(v=k1/n1,col='blue')
k2 <- 243
n2 <- 782
p <- seq(0,1,by=0.001)
plot(p,dbinom(x=k2,size=n2,prob=p,log=TRUE),
ylim=c(-10,-2),ylab="log-likelihood",
type='l')
abline(h=dbinom(x=k2,size=n2,prob=k2/n2,log=TRUE)-
0.5*qchisq(p=0.95,df=1),col='red')
abline(v=k2/n2,col='blue')
In the above two plots, the likelihood is a function of the model parameter
How do the two curves just plotted differ from one another? What features of the data are responsible for the differences?
Let’s suppose we have three samples,
n <- c(13,484,3200)
k <- c(4,217,1118)
dbinom(x=k,size=n,prob=0.2,log=TRUE)
## [1] -1.873761 -79.243371 -197.561806
sum(dbinom(x=k,size=n,prob=0.2,log=TRUE))
## [1] -278.6789
ll.fn <- function (p) {
sum(dbinom(x=k,size=n,prob=p,log=TRUE))
}
p <- seq(0,1,by=0.001)
loglik <- sapply(p,ll.fn)
plot(p,loglik,type='l',ylim=max(loglik)+c(-10,0))
Let’s revisit the model-fitting we did yesterday for the case of
measles in Niger. We’ll simplify the model slightly to eliminate some
unnecessary and wasteful elements. Our frequency-dependent SIR model,
again, is R
codes we used before to take account of this.
require(deSolve)
closed.sir.model <- function (t, x, params) {
inc <- params["b"]*x[1]*x[2] # incidence
list(c(-inc,inc-params["gamma"]*x[2]))
}
prediction <- function (params, times) {
out <- ode(
func=closed.sir.model,
y=params[c("X.0","Y.0")],
times=c(0,times),
parms=params
)
## return the Y variable only
## and discard Y(0)
out[-1,3]
}
Earlier, we used SSE as a measure of the discrepancy between model predictions and data:
sse <- function (params, data) {
times <- data$biweek/26 # convert to years
pred <- prediction(params,times)
discrep <- pred-data$measles
sum(discrep^2) # sum of squared errors
}
Now let’s use likelihood instead. Let’s suppose that, when we record cases, we make errors that are normal. Here’s how we can compute the likelihood of the data given the model and its parameters:
loglik <- function (params, data) {
times <- data$biweek/26
pred <- prediction(params,times)
sum(dnorm(x=data$measles,mean=pred,sd=params["sigma"],log=TRUE))
}
dat <- subset(niamey,community=="A")
params <- c(X.0=10000,Y.0=10,gamma=365/13,b=NA,sigma=1)
f <- function (b) {
par <- params
par["b"] <- b
loglik(par,dat)
}
b <- seq(from=0,to=0.02,by=0.0001)
ll <- sapply(b,f)
We plot the results:
plot(b,-ll,type='l',ylab=expression(-log(L)))
b.hat <- b[which.max(ll)]
abline(v=b.hat,lty=2)
The great similarity in the likelihood estimate to our first least-squares estimate is no accident. Why is this? Let
Suppose, alternatively, that the errors are log-normal with constant variance. Under what definition of SSE will least-squares and maximum likelihood give the same parameter estimates?
All this raises the question of what the best model for the errors
really is. Of course, the answer will certainly depend on the nature of
the data. The philosophy of likelihood encourages us to think about the
question mechanistically. When the data,
poisson.loglik <- function (params, data) {
times <- data$biweek/26
pred <- prediction(params,times)
sum(dpois(x=data$measles,lambda=params["p"]*pred[-1],log=TRUE))
}
Let’s see what the MLE parameters are for this model. We’ll start by estimating just one parameter. Now, we must have
dat <- subset(niamey,community=="A")
params <- c(X.0=20000,Y.0=1,gamma=365/13,b=NA,p=0.2)
## objective function (-log(L))
f <- function (log.b) {
params[c("b")] <- exp(log.b) # un-transform 'b'
-poisson.loglik(params,dat)
}
For something new, we’ll use the mle2
function from the bbmle
package to maximize the likelihood. mle2
will employ an iterative algorithm for maximizing the likelihood. To
get started, it needs an initial guess. It’s always a good idea to put a
bit of thought into the guess. Since
require(bbmle)
guess <- list(log.b=log(0.01))
fit0 <- mle2(f,start=guess)
fit0
##
## Call:
## mle2(minuslogl = f, start = guess)
##
## Coefficients:
## log.b
## -5.977186
##
## Log-likelihood: -1963.32
fit <- mle2(f,start=as.list(coef(fit0)))
fit
##
## Call:
## mle2(minuslogl = f, start = as.list(coef(fit0)))
##
## Coefficients:
## log.b
## -5.977186
##
## Log-likelihood: -1963.32
We can get an idea about the uncertainty and in particular obtain confidence intervals using the profile likelihood.
To profile over a parameter, we fix the value of that parameter at each
of several values, then maximize the likelihood over the remaining
unknown parameters. In bblme
, this is quite easy to obtain.
prof.b <- profile(fit)
plot(prof.b)
Now let’s try to estimate both
dat <- subset(niamey,community=="A")
params <- c(X.0=20000,Y.0=1,gamma=365/13,b=NA,p=NA)
logit <- function (p) log(p/(1-p)) # the logit transform
expit <- function (x) 1/(1+exp(-x)) # inverse logit
f <- function (log.b, logit.p) {
par <- params
par[c("b","p")] <- c(exp(log.b),expit(logit.p))
-poisson.loglik(par,dat)
}
guess <- list(log.b=log(0.005),logit.p=logit(0.2))
fit0 <- mle2(f,start=guess); fit0
##
## Call:
## mle2(minuslogl = f, start = guess)
##
## Coefficients:
## log.b logit.p
## -5.9918688 -0.1470357
##
## Log-likelihood: -394.2
fit <- mle2(f,start=as.list(coef(fit0))); fit
##
## Call:
## mle2(minuslogl = f, start = as.list(coef(fit0)))
##
## Coefficients:
## log.b logit.p
## -5.9918687 -0.1470357
##
## Log-likelihood: -394.2
## now untransform the parameters:
mle <- with(
as.list(coef(fit)),
c(
b=exp(log.b),
p=expit(logit.p)
)
)
mle
## b p
## 0.00249899 0.46330717
prof2 <- profile(fit)
plot(prof2)
We can also get confidence intervals:
ci <- confint(prof2)
ci
## 2.5 % 97.5 %
## log.b -5.9951952 -5.98854089
## logit.p -0.1953059 -0.09803465
ci[1,] <- exp(ci[1,])
ci[2,] <- expit(ci[2,])
rownames(ci) <- c("b","p")
ci
## 2.5 % 97.5 %
## b 0.002490691 0.00250732
## p 0.451328148 0.47551095
Let’s make a contour plot to visualize the likelihood surface.
dat <- subset(niamey,community=="A")
## this time the objective function has to
## take a vector argument
f <- function (pars) {
par <- params
par[c("b","p")] <- as.numeric(pars)
poisson.loglik(par,dat)
}
b <- seq(from=0.001,to=0.005,length=50)
p <- seq(0,1,length=50)
grid <- expand.grid(b=b,p=p)
grid$loglik <- apply(grid,1,f)
grid <- subset(grid,is.finite(loglik))
require(lattice)
contourplot(loglik~b+p,data=grid,cuts=20)
The scale over which the log likelihood is varying is clearly huge relative to what is meaningful. Let’s focus in on the region around the MLE.
b <- seq(from=0.00245,to=0.00255,length=50)
p <- seq(0.44,0.49,length=50)
grid <- expand.grid(b=b,p=p)
grid$loglik <- apply(grid,1,f)
grid <- subset(grid,is.finite(loglik))
require(lattice)
contourplot(loglik~b+p,data=grid,cuts=20)
Let’s look at the model’s predictions at the MLE. The model is a probability distribution, so we should look at a number of simulations. An important question is: are the data a plausible sample from the predicted probability distribution?
params[c("b","p")] <- mle
times <- c(dat$biweek/26)
model.pred <- prediction(params,times)
nsim <- 1000
simdat <- replicate(
n=nsim,
rpois(n=length(model.pred),
lambda=params["p"]*model.pred)
)
quants <- t(apply(simdat,1,quantile,probs=c(0.025,0.5,0.975)))
matplot(times,quants,col="blue",lty=c(1,2,1),type='l')
points(measles~times,data=dat,type='b',col='red')
Clearly the model is not doing a very good job of capturing the
pattern in the data. It appears that we will need an error model that
has the potential for more variability than does the Poisson. Recall
that, under the Poisson assumption, the variance of the error is equal
to the mean. The negative binomial distribution is such a distribution.
Let’s explore the alternative assumption that
loglik <- function (params, data) {
times <- data$biweek/26
pred <- prediction(params,times)
sum(dnbinom(x=data$measles,
mu=params["p"]*pred[-1],size=1/params["theta"],
log=TRUE))
}
f <- function (log.b, logit.p, log.theta) {
par <- params
par[c("b","p","theta")] <- c(exp(log.b),
expit(logit.p),
exp(log.theta))
-loglik(par,dat)
}
guess <- list(log.b=log(params["b"]),
logit.p=logit(params["p"]),
log.theta=log(1))
fit0 <- mle2(f,start=guess)
fit <- mle2(f,start=as.list(coef(fit0)))
fit
##
## Call:
## mle2(minuslogl = f, start = as.list(coef(fit0)))
##
## Coefficients:
## log.b logit.p log.theta
## -5.933817 1.407372 -0.594814
##
## Log-likelihood: -102.46
prof3 <- profile(fit)
plot(prof3)
mle <- with(
as.list(coef(fit)),
c(
b=exp(log.b),
p=expit(logit.p),
theta=exp(log.theta)
)
)
params[c("b","p","theta")] <- mle
times <- c(dat$biweek/26)
model.pred <- prediction(params,times)
nsim <- 1000
simdat <- replicate(
n=nsim,
rnbinom(n=length(model.pred),
mu=params["p"]*model.pred,
size=1/params["theta"])
)
quants <- t(apply(simdat,1,quantile,probs=c(0.025,0.5,0.975)))
matplot(times,quants,col="blue",lty=c(1,2,1),type='l')
lines(times,simdat[,1],col='black')
points(measles~times,data=dat,type='b',col='red')
What does this plot tell us? Essentially, the deterministic SIR model, as we’ve written it, cannot capture the shape of the epidemic. In order to fit the data, the optimization algorithm has expanded the error variance, to the point of absurdity. The typical model realization (in black) does not much resemble the data.
Revisit the other communities of Niamey and/or the British boarding school influenza data using the Poisson model and bbmle
.
Try to estimate
Reformulate the problem using the binomial error model. Modify the parameter estimation codes appropriately, estimate the parameters, and comment on the results.
Reformulate the problem using a normal error model in which the variance is proportional to the mean:
We’ve been treating the Niamey data as if they were direct—though
inaccurate—measurements of the prevalence. Actually, these are incidence
data: they are measures of unique infections. It would be more
appropriate to model these data by adding another equation
Anonymous. 1978. Influenza in a boarding school. British Medical Journal 1:587.
Grais, R. F., M. J. Ferrari, C. Dubray, O. N. Bjørnstad, B. T. Grenfell, A. Djibo, F. Fermon, and P. J. Guerin. 2006. Estimating transmission intensity for a measles epidemic in niamey, Niger: lessons for intervention. Trans R Soc Trop Med Hyg 100:867–873.