hurdle {pscl}R Documentation

Hurdle Models for Count Data Regression

Description

Fit hurdle regression models for count data via maximum likelihood.

Usage

hurdle(formula, data, subset, na.action, weights, offset,
  dist = c("poisson", "negbin", "geometric"),
  zero.dist = c("binomial", "poisson", "negbin", "geometric"),
  link = c("logit", "probit", "cloglog", "cauchit", "log"),
  control = hurdle.control(...),
  model = TRUE, y = TRUE, x = FALSE, ...)

Arguments

formula symbolic description of the model, see details.
data, subset, na.action arguments controlling formula processing via model.frame.
weights optional numeric vector of weights.
offset optional numeric vector with an a priori known component to be included in the linear predictor of the count model.
dist character specification of count model family.
zero.dist character specification of the zero hurdle model family.
link character specification of link function in the binomial zero hurdle (only used if zero.dist = "binomial".
control a list of control arguments specified via hurdle.control.
model, y, x logicals. If TRUE the corresponding components of the fit (model frame, response, model matrix) are returned.
... arguments passed to hurdle.control in the default setup.

Details

Hurdle count models are two-component models with a truncated count component for positive counts and a hurdle component that models the zero counts. Thus, unlike zero-inflation models, there are not two sources of zeros: the count model is only employed if the hurdle for modeling the occurence of zeros is exceeded. The count model is typically a truncated Poisson or negative binomial regression (with log link). The geometric distribution is a special case of the negative binomial with size parameter equal to 1. For modeling the hurdle (occurence of positive counts) either a binomial model can be employed or a censored count distribution. Binomial logit and censored geometric models as the hurdle part both lead to the same likelihood function and thus to the same coefficient estimates.

The formula can be used to specify both components of the model: If a formula of type y ~ x1 + x2 is supplied, then the same regressors are employed in both components. This is equivalent to y ~ x1 + x2 | x1 + x2. Of course, a different set of regressors could be specified for the zero hurdle component, e.g., y ~ x1 + x2 | z1 + z2 + z3 giving the count data model y ~ x1 + x2 conditional on (|) the zero hurdle model y ~ z1 + z2 + z3.

All parameters are estimated by maximum likelihood using optim, with control options set in hurdle.control. Starting values can be supplied, otherwise they are estimated by glm.fit (the default). By default, the two components of the model are estimated separately using two optim calls. Standard errors are derived numerically using the Hessian matrix returned by optim. See hurdle.control for details.

The returned fitted model object is of class "hurdle" and is similar to fitted "glm" objects. For elements such as "coefficients" or "terms" a list is returned with elements for the zero and count components, respectively. For details see below.

A set of standard extractor functions for fitted model objects is available for objects of class "hurdle", including methods to the generic functions print, summary, coef, vcov, logLik, residuals, predict, fitted, terms, model.matrix. See predict.hurdle for more details on all methods.

Value

An object of class "hurdle", i.e., a list with components including

coefficients a list with elements "count" and "zero" containing the coefficients from the respective models,
residuals a vector of raw residuals (observed - fitted),
fitted.values a vector of fitted means,
optim a list (of lists) with the output(s) from the optim call(s) for minimizing the negative log-likelihood(s),
control the control arguments passed to the optim call,
start the starting values for the parameters passed to the optim call(s),
weights the case weights used,
offset the offset vector used (if any),
n number of observations,
df.null residual degrees of freedom for the null model (= n - 2),
df.residual residual degrees of freedom for fitted model,
terms a list with elements "count", "zero" and "full" containing the terms objects for the respective models,
theta estimate of the additional theta parameter of the negative binomial model(s) (if negative binomial component is used),
SE.logtheta standard error(s) for log(theta),
loglik log-likelihood of the fitted model,
vcov covariance matrix of all coefficients in the model (derived from the Hessian of the optim output(s)),
dist a list with elements "count" and "zero" with character strings describing the respective distributions used,
link character string describing the link if a binomial zero hurdle model is used,
linkinv the inverse link function corresponding to link,
converged logical indicating successful convergence of optim,
call the original function call,
formula the original formula,
levels levels of the categorical regressors,
contrasts a list with elements "count" and "zero" containing the contrasts corresponding to levels from the respective models,
model the full model frame (if model = TRUE),
y the response count vector (if y = TRUE),
x a list with elements "count" and "zero" containing the model matrices from the respective models (if x = TRUE).

Author(s)

Achim Zeileis <Achim.Zeileis@R-project.org>

References

Cameron, A. Colin and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.

Cameron, A. Colin and Pravin K. Trivedi 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.

Mullahy, J. 1986. Specification and Testing of Some Modified Count Data Models. Journal of Econometrics. 33:341–365.

Zeileis, Achim, Christian Kleiber and Simon Jackman 2008. “Regression Models for Count Data in R.” Journal of Statistical Software, 27(8). URL http://www.jstatsoft.org/v27/i08/.

See Also

hurdle.control, glm, glm.fit, glm.nb, zeroinfl

Examples

## data
data("bioChemists", package = "pscl")

## logit-poisson
## "art ~ ." is the same as "art ~ . | .", i.e.
## "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment"
fm_hp1 <- hurdle(art ~ ., data = bioChemists)
summary(fm_hp1)

## geometric-poisson
fm_hp2 <- hurdle(art ~ ., data = bioChemists, zero = "geometric")
summary(fm_hp2)

## logit and geometric model are equivalent
coef(fm_hp1, model = "zero") - coef(fm_hp2, model = "zero")

## logit-negbin
fm_hnb1 <- hurdle(art ~ ., data = bioChemists, dist = "negbin")
summary(fm_hnb1)

## negbin-negbin
fm_hnb2 <- hurdle(art ~ ., data = bioChemists, dist = "negbin", zero = "negbin")
summary(fm_hnb2)

[Package pscl version 1.03 Index]