 ## Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

# Regression for Count Data

### Introduction

#### Count data

In general, common parametric tests like t-test and anova shouldn’t be used for count data.  One reason is technical in nature:  that parametric analyses require continuous data.  Count data is by its nature discrete and is left-censored at zero.  (That is, usually counts can’t be less than zero.)

A second reason is more practical in nature.  Count data are often highly skewed, and often produce skewed residuals if a parametric approach is attempted.  In this case, the hypothesis tests will not be accurate.

For further discussion, see the “Count data may not be appropriate for common parametric tests” section in the Introduction to Parametric Tests chapter.

#### Regression approaches for count data

The most common regression approach for handling count data is probably Poisson regression.  However, Poisson regression makes assumptions about the distribution of the data that may not be appropriate in all cases.  Hermite regression is a more flexible approach, but at the time of writing doesn’t have a complete set of support functions in R.  Quasi-Poisson regression is also flexible with data assumptions, but also but at the time of writing doesn’t have a complete set of support functions in R.  Negative binomial regression allows for overdispersion in data; and zero-inflated regression is useful when there are a high proportion of zero counts in the data.

#### Cautionary note

Note that model assumptions and pitfalls of these regression techniques are not discussed in depth here.  The reader is urged to understand the assumptions of this kind of modeling before proceeding.

#### Generalized linear regression

Poisson, Hermite, and related regression approaches are a type of generalized linear model.  This should not be confused with general linear model, which is implemented with the lm function.  Generalized linear models are implemented with the glm function or other functions.

Generalized linear models are used when the dependent variable is count, binary, multinomial, etc.  More information on using the glm function can be found by using help(glm) and help(family).  For examples of logistic regression, see the chapter Models for Nominal Data; the chapter Beta Regression for Percent and Proportion Data; or Mangiafico (2015) in the “References” section.  For a table of common uses for family and link function in generalized linear models, see the Wikipedia article in the “References” section for this chapter.

### Packages used in this chapter

The packages used in this chapter include:

•  pysch

•  hermite

•  lattice

•  plyr

•  boot

•  DescTools

•  ggplot2

•  car

•  multcompView

•  emmeans

•  MASS

•  pscl

•  rcompanion

•  robust

The following commands will install these packages if they are not already installed:

if(!require(psych)){install.packages("psych")}
if(!require(hermite)){install.packages("hermite")}
if(!require(lattice)){install.packages("lattice")}
if(!require(plyr)){install.packages("plyr")}
if(!require(boot)){install.packages("boot")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(car)){install.packages("car")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(emmeans)){install.packages("emmeans")}
if(!require(MASS)){install.packages("MASS")}
if(!require(pscl)){install.packages("pscl")}
if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(robust)){install.packages("robust")}

### Count data example

In this example, extension researchers have set up garden plots with different suites of plants, with each suite identified as a level of the variable Garden below.  In September, they counted the number of monarch butterflies in each garden plot.

Input = ("
Garden   Monarchs
A        0
A        4
A        2
A        2
A        0
A        6
A        0
A        0
B        5
B        9
B        7
B        5
B        7
B        5
B        9
B        5
C       10
C       14
C       12
C       12
C       10
C       16
C       10
C       10
")

###  Order factors by the order in data frame

###  Otherwise, R will alphabetize them

Data\$Garden = factor(Data\$Garden,
levels=unique(Data\$Garden))

###  Check the data frame

library(psych)

str(Data)

summary(Data)

### Remove unnecessary objects

rm(Input)

#### Histograms

library(lattice)

histogram(~ Monarchs | Garden,
data=Data,
layout=c(1,3)      #  columns and rows of individual plots
) ### Poisson regression example

Poisson regression makes certain assumptions about the relationship between the mean and the dispersion of the dependent variable.  Because this assumption may not be met for all data sets, Poisson regression may not be recommended for routine use.  Particularly, classic Poisson regression should be avoided if there is overdispersion in the data or if there are several zero counts in the dependent variable.

An alternate approach for data with overdispersion is negative binomial regression.

An alternative approach for data with many zeros is zero-inflated Poisson regression.

For further discussion, see the “Count data may not be appropriate for common parametric tests” section in the Introduction to Parametric Tests chapter.

Note that model assumptions and pitfalls of this approach are not discussed here.  The reader is urged to understand the assumptions of this kind of modeling before proceeding.

model.p = glm(Monarchs ~ Garden,
data=Data,
family="poisson")

library(car)

Anova(model.p,
type="II",
test="LR")

Analysis of Deviance Table (Type II tests)

LR Chisq Df Pr(>Chisq)
Garden   66.463  2  3.697e-15 ***

library(rcompanion)

nagelkerke(model.p)

\$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
Cox and Snell (ML)                   0.937293
Nagelkerke (Cragg and Uhler)         0.938037

\$Likelihood.ratio.test
Df.diff LogLik.diff  Chisq    p.value
-2     -33.231 66.463 3.6967e-15

library(multcompView)

library(emmeans)

marginal = emmeans(model.p,
~ Garden)

pairs(marginal,

cld(marginal,
alpha=0.05,
Letters=letters,  ### Use lower-case letters for .group

Garden     emmean        SE df   asymp.LCL asymp.UCL .group
A      0.5596158 0.2672450 NA -0.07849527  1.197727  a
B      1.8718022 0.1386750 NA  1.54068251  2.202922   b
C      2.4638532 0.1031421 NA  2.21757688  2.710130    c

Results are given on the log (not the response) scale.
Confidence level used: 0.95
Conf-level adjustment: sidak method for 3 estimates
P value adjustment: tukey method for comparing a family of 3 estimates
Tests are performed on the log scale
significance level used: alpha = 0.05

### Note that estimates are on log scale

### Negative binomial regression example

Negative binomial regression is similar in application to Poisson regression, but allows for overdispersion in the dependent count variable.

This example will use the glm.nb function in the MASS package.  The Anova function in the car package will be used for an analysis of deviance, and the nagelkerke function will be used to determine a p-value and pseudo R-squared value for the model.  Post-hoc analysis can be conducted with the emmeans package.

Note that model assumptions and pitfalls of this approach are not discussed here.  The reader is urged to understand the assumptions of this kind of modeling before proceeding.

library(MASS)

model.nb = glm.nb(Monarchs ~ Garden,
data=Data,
control = glm.control(maxit=10000))

library(car)

Anova(model.nb,
type="II",
test="LR")

Analysis of Deviance Table (Type II tests)

LR Chisq Df Pr(>Chisq)
Garden   66.464  2  3.694e-15 ***

library(rcompanion)

nagelkerke(model.nb)

\$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
Cox and Snell (ML)                   0.776007
Nagelkerke (Cragg and Uhler)         0.778217

\$Likelihood.ratio.test
Df.diff LogLik.diff  Chisq    p.value
-2     -17.954 35.907 1.5952e-08

library(multcompView)

library(emmeans)

marginal = emmeans(model.nb,
~ Garden)

pairs(marginal,

cld(marginal,
alpha   = 0.05,
Letters = letters,    ### Use lower-case letters for .group
type    = "response", ### Report emmeans in orginal scale

Garden response        SE df asymp.LCL asymp.UCL .group
A          1.75 0.4677072 NA 0.9244706  3.312707  a
B          6.50 0.9013878 NA 4.6677750  9.051422   b
C         11.75 1.2119200 NA 9.1850474 15.031223    c

Confidence level used: 0.95
Conf-level adjustment: sidak method for 3 estimates
Intervals are back-transformed from the log scale
P value adjustment: tukey method for comparing a family of 3 estimates
Tests are performed on the log scale
significance level used: alpha = 0.05

### Zero-inflated regression example

Zero-inflated regression is similar in application to Poisson regression, but allows for an abundance of zeros in the dependent count variable.

This example will use the zeroinfl function in the pscl package.  The Anova function in the car package will be used for an analysis of deviance, and the nagelkerke function will be used to determine a p-value and pseudo R-squared value for the model.  Post-hoc analysis can be conducted with the emmeans package.

library(pscl)

model.zi = zeroinfl(Monarchs ~ Garden,
data = Data,
dist = "poisson")

### dist = "negbin" may be used

summary(model.zi)

Call:
zeroinfl(formula = Monarchs ~ Garden | Garden, data = Data, dist = "poisson")

Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.2182     0.2847   4.278 1.89e-05 ***
GardenB       0.6536     0.3167   2.064    0.039 *
GardenC       1.2457     0.3029   4.113 3.90e-05 ***

Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.046e-02  7.363e-01  -0.096    0.924
GardenB     -2.057e+01  1.071e+04  -0.002    0.998
GardenC     -2.057e+01  1.071e+04  -0.002    0.998

### Note that there are separate coefficients for the
###  Poisson part of the analysis and for the zero-inflation part.

library(car)

Anova(model.zi,
type="II",
test="Chisq")

Analysis of Deviance Table (Type II tests)

Df  Chisq Pr(>Chisq)
Garden  2 23.914  6.414e-06 ***

library(rcompanion)

nagelkerke(model.zi)

\$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
Cox and Snell (ML)                   0.797356
Nagelkerke (Cragg and Uhler)         0.800291

\$Likelihood.ratio.test
Df.diff LogLik.diff  Chisq    p.value
-4     -19.156 38.311 9.6649e-08

library(multcompView)

library(emmeans)

marginal = emmeans(model.zi,
~ Garden)

pairs(marginal,

cld(marginal,
alpha=0.05,
Letters=letters,  ### Use lower-case letters for .group

Garden  emmean        SE df   asymp.LCL asymp.UCL .group
A        1.75 0.7586301 NA -0.06140972  3.561410  a
B        6.50 0.9013877 NA  4.34772229  8.652278   b
C       11.75 1.2119199 NA  8.85625304 14.643747    c

Confidence level used: 0.95
Conf-level adjustment: sidak method for 3 estimates
P value adjustment: tukey method for comparing a family of 3 estimates
significance level used: alpha = 0.05

### Note, emmeans are on the original measurement scale

### Robust Poisson regression example

Robust Poisson regression is robust to outliers in the dependent variable.

This example uses the glmRob function in the robust package.  The anova function can be used to conduct an analysis of deviance.  The p-value for the model can be found by comparing the model to a null model. However, at the time of writing, I don’t know of any way to determine AIC or pseudo R-squared for the model.

At the time of writing, the glmRob function can only use the Poisson and binomial families of models.

An alternate method is the glmrob function in the robustbase package.

library(robust)

model.rob = glmRob(Monarchs ~ Garden,
data = Data,
family = "poisson")

anova(model.rob, test="Chisq")

Df Deviance Resid. Df Resid.
Dev     Pr(>Chi)
NULL   NA       NA        23  430.19850           NA
Garden  2 400.9221        21   29.27641 3.567693e-63

model.rob.null = glmRob(Monarchs ~ 1,
data = Data,
family = "poisson")

anova(model.rob.null, model.rob, test="Chisq")

Terms Resid. Df Resid. Dev Test Df Deviance     Pr(>Chi)
1      1        23   95.12606      NA       NA           NA
2 Garden        21   29.27641       2 65.84965 5.536815e-11

### Quasi-Poisson regression

Quasi-Poisson regression is useful since it has a variable dispersion parameter, so that it can model over-dispersed data.   It may be better than negative binomial regression in some circumstances (Verhoef and Boveng. 2007).

At the time of writing, Quasi-Poisson regression doesn’t have complete set of support functions in R.   Using the quasipoisson family option in the glm function, the results will have the same parameter coefficients as with the poisson option, but the inference statistics are adjusted in the summary function.  The Anova function in the car package can be used for an analysis of deviance table, and the emmeans package can be used for post-hoc comparisons.  Since the model doesn’t produce a log-likelihood value, I don’t know a way to produce a p-value for the mode, for a pseudo R-squared value for the model.

.

model.qp = glm(Monarchs ~ Garden,
data=Data,
family="quasipoisson")

library(car)

Anova(model.qp,
type="II",
test="LR")

Analysis of Deviance Table (Type II tests)

Response: Monarchs
LR Chisq Df Pr(>Chisq)
Garden   52.286  2  4.429e-12 ***

library(multcompView)

library(emmeans)

marginal = emmeans(model.qp,
~ Garden)

pairs(marginal,

cld(marginal,
alpha=0.05,
Letters=letters,

Garden    emmean        SE df  asymp.LCL asymp.UCL .group
A      0.5596158 0.3013057 NA -0.1598233  1.279055  a
B      1.8718022 0.1563493 NA  1.4984809  2.245123   b
C      2.4638532 0.1162877 NA  2.1861887  2.741518    c

Results are given on the log (not the response) scale.
Confidence level used: 0.95
Conf-level adjustment: sidak method for 3 estimates
P value adjustment: tukey method for comparing a family of 3 estimates
significance level used: alpha = 0.05

### Hermite regression

The generalized Hermite distribution is a more general distribution that can handle overdispersion or multimodality (Moriña and others, 2015).  This makes generalized Hermite regression a powerful and flexible tool for modeling count data.  It is implemented with the hermite package.

Fitting models with the hermite package can be somewhat difficult.  One issue is that model fitting may fail without some parameters being specified.  Often specifying an appropriate value for the m option will help.

A further difficulty with this approach is that, at the time writing, the package isn’t supported by the anova function to compare models, the Anova function to test effects, or other useful functions like emmeans for factor effects.

The hermite package is used to conduct hermite regression.  Here, the m=3 option is specified.  Often the default m=NULL can be used.  In this case, if the m value is not specified, the function cannot complete the model fitting, and errors are produced.  Using m=2 often works.  Here, m=3 was used because it produced a model with a lower AIC than did the m=2 option.

library(hermite)

model = glm.hermite(Monarchs ~ Garden,
data = Data,
m=3)

summary(model)

Coefficients:
Estimate Std. Error   z value      p-value
(Intercept)      0.5081083  0.3251349 1.5627612 1.181088e-01
GardenB          1.3700567  0.3641379 3.7624662 1.682461e-04
GardenC          1.9596153  0.3476326 5.6370291 1.730089e-08
dispersion.index 1.0820807  0.2877977 0.1281707 3.601681e-01
order            3.0000000         NA        NA           NA
(Likelihood ratio test against Poisson is reported by *z value* for *dispersion.index*)

AIC:  112.7762

#### Post-hoc analysis: Medians and confidence intervals

At the time of writing, the emmeans package does not support post-hoc analysis of regressions produced with the hermite package.

One imperfect approach for post-hoc analysis would be to examine median counts for treatments and the confidence intervals of these medians.  We can conclude that groups with non-overlapping 95% confidence intervals for their medians are significantly different.

However, this approach does not represent any information learned from the Hermite regression.

A second issue is that, because the dependent variable is not continuous, the distribution of the bootstrapped confidence intervals is not likely to be continuous, and so is may not be reliable.

To get confidence intervals for the medians for each group, we will use the groupwiseMedian function.  Here I used the percentile method for confidence intervals.

library(rcompanion)

Sum = groupwiseMedian(Monarchs ~ Garden,
data=Data,
conf=0.95,
R=5000,
percentile=TRUE,
bca=FALSE,
digits=3)

Sum

Garden n Median Conf.level Percentile.lower Percentile.upper
1      A 8      1       0.95                0                4
2      B 8      6       0.95                5                8
3      C 8     11       0.95               10               14

###  In this case, none of the confidence intervals overlap.

##### Plot of medians and confidence intervals

The data frame Sum created above will be passed to ggplot for plotting.  At the end of the code, annotate is used to add text to the plot to indicate which medians are significantly different from one another.

library(ggplot2)

ggplot(Sum,                ### The data frame to use.
aes(x = Garden,
y = Median)) +
geom_errorbar(aes(ymin = Percentile.lower,
ymax = Percentile.upper),
width = 0.05,
size  = 1) +
geom_point(shape = 15,
size  = 5) +
theme_bw() +
theme(axis.title   = element_text(face  = "bold")) +
ylab("Median count of monarch butterflies") +

annotate("text",
x = 1:3,
y = c(5, 10, 15),
label = c("Group 3", "Group 2", "Group 1")) ### Optional code for chi-square goodness-of-fit test

An alternative approach to handling count data is to sum up the counts for treatments, and use a chi-square test or related test.  Here, a chi-square goodness-of-fit test is used to see if counts differ from “expected” equal proportions.

#### Omnibus test

Tabla = xtabs(Monarchs ~ Garden,
data = Data)

Tabla

Garden
A  B  C
14 52 94

chisq.test(Tabla)

Chi-squared test for given probabilities

X-squared = 60.05, df = 2, p-value = 9.127e-14

#### Post-hoc chi-square tests

Garden.A = sum(Data\$Monarchs[Data\$Garden=="A"])
Garden.B = sum(Data\$Monarchs[Data\$Garden=="B"])
Garden.C = sum(Data\$Monarchs[Data\$Garden=="C"])

observed = c(Garden.A, Garden.B)   # observed frequencies
expected = c(1/2, 1/2)             # expected proportions

chisq.test(x = observed,
p = expected)

Chi-squared test for given probabilities

X-squared = 21.879, df = 1, p-value = 2.904e-06

observed = c(Garden.B, Garden.C)   # observed frequencies
expected = c(1/2, 1/2)             # expected proportions

chisq.test(x = observed,
p = expected)

Chi-squared test for given probabilities

X-squared = 12.082, df = 1, p-value = 0.0005091

observed = c(Garden.A, Garden.C)   # observed frequencies
expected = c(1/2, 1/2)             # expected proportions

chisq.test(x = observed,
p = expected)

Chi-squared test for given probabilities

X-squared = 59.259, df = 1, p-value = 1.382e-14

### Optional analysis: Vuong test to compare Poisson, negative binomial, and zero-inflated models

The Vuong test, implemented by the pscl package, can test two non-nested models.  It works with negbin, zeroinfl, and some glm model objects which are fitted to the same data.

The null hypothesis is that there is no difference in models.  The function produces three tests, a “Raw” test, an AIC-corrected, and a BIC-corrected, any of which could be used.

It has been suggested that the Vuong test not be used to test for zero-inflation (Wilson, 2015).

Define models

model.p = glm(Monarchs ~ Garden,
data=Data,
family="poisson")

library(MASS)

model.nb = glm.nb(Monarchs ~ Garden,
data=Data,
control = glm.control(maxit=10000))

library(pscl)

model.zi = zeroinfl(Monarchs ~ Garden,
data = Data,
dist = "poisson")

Vuong test

library(pscl)

vuong(model.p,
model.nb,
digits = 4)

Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic             H_A p-value
Raw                  0.03324988 model1 > model2 0.48674
AIC-corrected        0.03324988 model1 > model2 0.48674
BIC-corrected        0.03324988 model1 > model2 0.48674

### Positive Vuong z-statistic suggests that model 1 is superior,
###   but, in this case, the difference is not significant,
###   and the value of the statistic is probably too tiny to be meaningful.

vuong(model.p,
model.zi,
digits = 4)

Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic             H_A  p-value
Raw                  -1.4424725 model2 > model1 0.074585
AIC-corrected        -0.4335210 model2 > model1 0.332318
BIC-corrected         0.1607786 model1 > model2 0.436134

### Negative Vuong z-statistic suggests that model 2 is superior.
### If the Raw statistic is used, p = 0.07 gives some evidence
###   that zi model is superior.

vuong(model.nb,
model.zi,
digits = 4)

Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic             H_A  p-value
Raw                  -1.4424725 model2 > model1 0.074585
AIC-corrected        -0.4335210 model2 > model1 0.332318
BIC-corrected         0.1607786 model1 > model2 0.436134

### Negative Vuong z-statistic suggests that model 2 is superior.
### If the Raw statistic is used, p = 0.07 gives some evidence
###   that zi model is superior.

### References

Moriña, D., M. Higueras, P. Puig, and M. Oliveira. 2015. Generalized Hermite Distribution

Modelling with the R Package hermite. The R Journal 7(2):263–274. journal.r-project.org/archive/2015-2/morina-higueras-puig-etal.pdf.

help(package="hermite")

library(hermite); ?glm.hermite

library(MASS); ?glm.nb

library(pscl); ?zeroinfl

library(pscl); ?vuong

“Simple Logistic Regression” in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_06.html.

"Generalized linear model: Link function". No date. Wikipedia. Retrieved 31 Jan. 2016. en.wikipedia.org/wiki/Generalized_linear_model#Link_function.

Verhoef, J.M. and P.L. Boveng. 2007. Quasi-Poisson vs. negative binomial regression: How should we model overdispersed count data? Ecology 88(11) 2766–2772. http://fisher.utstat.toronto.edu/reid/sta2201s/QUASI-POISSON.pdf.

Wilson, P. 2015. The Misuse of the Vuong Test for Non-Nested Models to Test for Zero-Inflation. Economic Letters 127: 51–53. cybermetrics.wlv.ac.uk/paperdata/misusevuong.pdf

### References for count data

Grace-Martin, K. No date. "Regression Models for Count Data". The Analysis Factor. www.theanalysisfactor.com/regression-models-for-count-data/.

Grace-Martin, K. No date. " Zero-Inflated Poisson Models for Count Outcomes". The Analysis Factor. www.theanalysisfactor.com/zero-inflated-poisson-models-for-count-outcomes/.

[IDRE] Institute for Digital Research and Education. 2015. “R Data Analysis Examples: Poisson Regression”. UCLA.  www.ats.ucla.edu/stat/r/dae/poissonreg.htm.

[IDRE] Institute for Digital Research and Education. 2015. “R Data Analysis Examples: Negative Binomial Regression”.

[IDRE] Institute for Digital Research and Education. 2015. “R Data Analysis Examples: Zero-Inflated Poisson Regression”.

[IDRE] Institute for Digital Research and Education. 2015. “R Data Analysis Examples: Zero-Truncated Poisson Regression”.