Packages used in this chapter
The packages used in this chapter include:
• car
• multcompView
• lsmeans
• VGAM
• lmtest
• psych
• lme4
• MASS
• rcompanion
The following commands will install these packages if they are not already installed:
if(!require(car)){install.packages("car")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(lsmeans)){install.packages("lsmeans")}
if(!require(VGAM)){install.packages("VGAM")}
if(!require(lmtest)){install.packages("lmtest")}
if(!require(psych)){install.packages("psych")}
if(!require(lme4)){install.packages("lme4")}
if(!require(MASS)){install.packages("MASS")}
if(!require(rcompanion)){install.packages("rcompanion")}
Log-linear models for tests of association
Analysis of contingency tables with two or more dimensions can be accomplished with log-linear models. These will test for association. When there are multiple dimensions, this approach is sometimes called multiway frequency analysis. It is analogous to chi-square tests for two-way tables or Cochran–Mantel–Haenszel test for three way tables.
One advantage of log-linear models is that they handle tables with a higher number of dimensions.
Another advantage is that the models can be specified to test for more complicated patterns of association. When each term in the model is included, and there are no interactions, the model tests for mutual independence of the terms. Models can also test for conditional independence and partial independence.
For more information on specifying models for conditional independence or partial independence, see:
Quick-R. 2014. Frequencies and Crosstabs. www.statmethods.net/stats/frequencies.html.
Agresti, A. 2007. An Introduction to Categorical Data Analysis 2nd Edition. Wiley-Interscience.
For a more complete discussion of log-linear models for multiway frequency analysis see:
Tabachnick, B.G. and S.F. Fidell. 2001. Using multivariate statistics. 4th Edition. Allyn and Bacon, Boston, MA.
Zero frequencies in cells may cause the maximum likelihood estimation to fail, or may cause bias in the results.
Structural zeros can be handled with the start argument in the loglm function. See library(MASS); ?loglm , library(MASS); ?loglm1 , and ?loglin .
Log-linear model example
The following example revisits Alexander Anderson’s data of passing grades by sex within counties, for which we had used the Cochran–Mantel–Haenszel test. Here we use a log-linear model with the loglm function in the MASS package.
Input = ("
County Sex Result Count
Bloom Female Pass 9
Bloom Female Fail 5
Bloom Male Pass 7
Bloom Male Fail 17
Cobblestone Female Pass 11
Cobblestone Female Fail 4
Cobblestone Male Pass 9
Cobblestone Male Fail 21
Dougal Female Pass 9
Dougal Female Fail 7
Dougal Male Pass 19
Dougal Male Fail 9
Heimlich Female Pass 15
Heimlich Female Fail 8
Heimlich Male Pass 14
Heimlich Male Fail 17
")
Data = read.table(textConnection(Input),header=TRUE)
### Order factors otherwise R will alphabetize them
Data$County = factor(Data$County,
levels=unique(Data$County))
Data$Sex = factor(Data$Sex,
levels=unique(Data$Sex))
Data$Result = factor(Data$Result,
levels=unique(Data$Result))
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Create table
Table = xtabs(Count ~ Sex + Result + County,
data=Data)
ftable(Table) # Display a
flattened table
County Bloom Cobblestone Dougal Heimlich
Sex Result
Female Pass 9 11 9 15
Fail 5 4 7 8
Male Pass 7 9 19 14
Fail 17 21 9 17
Log-linear model
The model here tests for mutual independence of the three variables.
library(MASS)
loglm( ~ Sex + Result + County,
Table)
Statistics:
X^2 df P(> X^2)
Likelihood Ratio 20.96662 10 0.0213275
Pearson 20.79050 10 0.0226027
Post-hoc analysis
For a post-hoc analysis, you could slice up a multi-dimensional table in any way that makes sense for the hypotheses you want to test. Here, we’ll look for an association of Sex and Result within each county.
Third dimension in our table is County, so Table[,,1] yields the piece of the table where the third dimension is equal to 1, that is, where County = Bloom. And so on.
Bloom = Table[,,1]
loglm( ~ Sex + Result,
Bloom)
X^2 df P(> X^2)
Likelihood Ratio 4.504069 1 0.03381429
Pearson 4.473688 1 0.03442062
Cobblestone = Table[,,2]
loglm( ~ Sex + Result,
Cobblestone)
X^2 df P(> X^2)
Likelihood Ratio 7.777229 1 0.005290890
Pearson 7.605000 1 0.005820666
Dougal = Table[,,3]
loglm( ~ Sex + Result,
Dougal)
X^2 df P(> X^2)
Likelihood Ratio 0.5876125 1 0.4433438
Pearson 0.5927934 1 0.4413410
Heimlich = Table[,,4]
loglm( ~ Sex + Result,
Heimlich)
X^2 df P(> X^2)
Likelihood Ratio 2.158817 1 0.1417538
Pearson 2.136182 1 0.1438595
Summary of analysis from Cochran–Mantel–Haenszel tests and log-linear models
County C-H-M log-linear
p-value p-value (lr)
Bloom 0.0468 0.03381
Cobblestone 0.0102 0.00529
Dougal 0.5230 0.4433
Heimlich 0.1750 0.14175
Logistic regression
Logistic regression is a common option for building models with a nominal dependent variable. For standard logistic regression, the dependent variable must have only two levels. That is, it must be dichotomous.
The following references should be useful for conducting logistic regression.
“Simple Logistic Regression” in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_06.html.
“Multiple Logistic Regression” in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_07.html.
“Simple logistic regression” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/simplelogistic.html.
“Multiple logistic regression” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/multiplelogistic.html.
Logistic regression example
Note that model assumptions and pitfalls of logistic regression are not discussed here. The reader is urged to read the preceding references or other appropriate materials before proceeding.
The following example revisits Alexander Anderson pesticide safety training course data across gender and four counties, for which we had used the Cochran–Mantel–Haenszel Test.
Data = read.table(header=TRUE, stringsAsFactors=TRUE, text="
County Gender Result Count
Bloom Female Pass 9
Bloom Female Fail 5
Bloom Male Pass 7
Bloom Male Fail 17
Cobblestone Female Pass 11
Cobblestone Female Fail 4
Cobblestone Male Pass 9
Cobblestone Male Fail 21
Dougal Female Pass 9
Dougal Female Fail 7
Dougal Male Pass 19
Dougal Male Fail 9
Heimlich Female Pass 15
Heimlich Female Fail 8
Heimlich Male Pass 14
Heimlich Male Fail 17
")
### Order factors otherwise R will alphabetize them
Data$County = factor(Data$County,
levels=unique(Data$County))
Data$Gender = factor(Data$Gender,
levels=unique(Data$Gender))
Data$Result = factor(Data$Result,
levels=unique(Data$Result))
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
Logistic regression
Logistic regression can be conducted with the glm function in the native stats package. The summary function will return coefficients for the model.
The Anova function in the car package will produce an analysis of deviance table with either likelihood ratio, Wald, or F tests.
To derive a p-value and a pseudo R-squared value for the model it is recommended to convert the data to long form before fitting the model.
model = glm(Result ~ County + Gender + County:Gender,
weight = Count,
data = Data,
family = binomial(link="logit")
)
summary(model)
library(car)
Anova(model,
type="II",
test="LR")
Analysis of Deviance Table (Type II tests)
LR Chisq Df Pr(>Chisq)
County 4.9627 3 0.174548
Gender 7.8108 1 0.005193 **
County:Gender 7.2169 3 0.065296 .
library(tidyr)
Long = uncount(Data, Count)
str(Long)
'data.frame': 181 obs. of 3 variables:
$ County: Factor w/ 4 levels "Bloom","Cobblestone",..: 1
1 1 1 1 1 1 1 1 1 ...
$ Gender: Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1
1 1 1 1 ...
$ Result: Factor w/ 2 levels "Pass","Fail": 1 1 1 1 1 1 1
1 1 2 ...
model = glm(Result ~ County + Gender + County:Gender,
data = Long,
family = binomial())
library(car)
Anova(model)
Analysis of Deviance Table (Type II tests)
LR Chisq Df Pr(>Chisq)
County 4.9627 3 0.174548
Gender 7.8108 1 0.005193 **
County:Gender 7.2169 3 0.065296 .
library(rcompanion)
nagelkerke(model)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.0797857
Cox and Snell (ML) 0.1046550
Nagelkerke (Cragg and Uhler) 0.1395750
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-7 -10.004 20.009 0.0055508
library(rcompanion)
efronRSquared(model)
EfronRSquared
0.108
countRSquare(model)
$Result
Count.R2 Count.R2.corrected
1 0.652 0.284
$Confusion.matrix
Predicted
Actual 0 1 Sum
0 63 30 93
1 33 55 88
Sum 96 85 181
Post-hoc analysis
Post-hoc analysis can be conducted with the emmeans package. Here, the type= "response " option is used to report results on a probability scale. For notes on using emmeans with different model types, see ?emmeans::models.
Normally, because the County x Gender interaction was not significant, we wouldn’t want to explore the mean separations of the County x Gender interaction, but here the results are used to compare Gender within each level of County to compare to the other methods in this chapter.
See ?emmeans::summary for p-value adjustment options in emmeans.
library(multcompView)
library(emmeans)
marginal = emmeans(model, ~ Gender|County, type="response")
marginal
pairs(marginal)
County = Bloom:
contrast odds.ratio SE df null z.ratio p.value
Female / Male 0.229 0.164 Inf 1 -2.060 0.0394
County = Cobblestone:
contrast odds.ratio SE df null z.ratio p.value
Female / Male 0.156 0.110 Inf 1 -2.630 0.0085
County = Dougal:
contrast odds.ratio SE df null z.ratio p.value
Female / Male 1.642 1.061 Inf 1 0.767 0.4429
County = Heimlich:
contrast odds.ratio SE df null z.ratio p.value
Female / Male 0.439 0.249 Inf 1 -1.450 0.1470
Tests are performed on the log odds ratio scale
cld(marginal, Letters=letters)
Summary table of results
Comparison p-value
Bloom–Female - Bloom–Male 0.039
Cobblestone–Female - Cobblestone–Male 0.0085
Dougal–Female - Dougal–Male 0.44
Heimlich–Female - Heimlich–Male 0.15
Multinomial logistic regression
If the nominal dependent variable has more than two levels, multinomial logistic regression can be used.
VGAM package
The VGAM package provides a flexible framework for building models with categorical data. The vignette for the package provides an overview of the package, as well as brief overview of categorical analysis in R.
Yee, T.W. The VGAM package for Categorical Data Analysis. The Comprehensive R Archive Network.
The following article may also be helpful for those interested in using the VGAM package.
Yee, T.W. 2008. The VGAM Package. R News, 8(2), 28–39. URL. cran.r-project.org/doc/Rnews/Rnews_2008-2.pdf.
The VGAM package can be explored through the help files for the package.
if(!require(VGAM)){install.packages("VGAM")}
help(package="VGAM")
And multinomial regression is discussed specifically in the entry for multinomial.
library(VGAM)
?multinomial
mlogit and nnet packages
Two other packages that can perform multinomial regression are mlogit and nnet. The packages can be explored with the following vignette, article, and the package help files.
if(!require(mlogit)){install.packages("mlogit")}
help(package="mlogit")
if(!require(nnet)){install.packages("nnet")}
help(package="nnet")
Multinomial regression example
The following example revisits Alexander Anderson’s data of passing grades by sex within counties, for which we had used the Cochran–Mantel–Haenszel Test.
Because the dependent variable, Result, has only two levels, it could be modeled with standard binomial regression. However, a multinomial approach with VGAM will be used here as an example. Note that in multinomial regression, a reference level for the dependent variable must be specified. Coefficients of the model will change if a different reference level is specified.
Note that model assumptions and pitfalls of this approach are not discussed here. The reader is urged to understand the assumptions of this kind of modeling before proceeding.
The summary function will provide estimates of coefficients and standard errors for the coefficients. A p-value for the overall model as well as pseudo R-squared value is provided by the nagelkerke function.
The lrtest function in VGAM can be used to compare two nested models with likelihood ratio test. If only one model is given, it provides a p-value for the overall model, compared with a null model. The Anova function in the car package will provide chi-square tests for individual factors in the model. The user might prefer to use likelihood ratio tests with nested models instead of chi-square tests. This could be accomplished with the lrtest function in the lmtest package, or by specifying the second model with the null= option in the nagelkerke function.
Input = ("
County Sex Result Count
Bloom Female Pass 9
Bloom Female Fail 5
Bloom Male Pass 7
Bloom Male Fail 17
Cobblestone Female Pass 11
Cobblestone Female Fail 4
Cobblestone Male Pass 9
Cobblestone Male Fail 21
Dougal Female Pass 9
Dougal Female Fail 7
Dougal Male Pass 19
Dougal Male Fail 9
Heimlich Female Pass 15
Heimlich Female Fail 8
Heimlich Male Pass 14
Heimlich Male Fail 17
")
Data = read.table(textConnection(Input),header=TRUE)
### Order factors otherwise R will alphabetize them
Data$County = factor(Data$County,
levels=unique(Data$County))
Data$Sex = factor(Data$Sex,
levels=unique(Data$Sex))
Data$Result = factor(Data$Result,
levels=unique(Data$Result))
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Multinomial regression
library(VGAM)
model = vglm(Result ~ Sex + County + Sex:County,
family=multinomial(refLevel=1),
weights = Count,
data = Data)
summary(model)
library(car)
Anova(model,
type="II",
test="Chisq")
Analysis of Deviance Table (Type II tests)
Response: Result
Df Chisq Pr(>Chisq)
Sex 1 6.7132 0.00957 **
County 3 4.1947 0.24120
Sex:County 3 7.1376 0.06764 .
library(rcompanion)
nagelkerke(model)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.0797857
Cox and Snell (ML) 0.7136520
Nagelkerke (Cragg and Uhler) 0.7136520
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
7 -10.004 20.009 0.0055508
library(lmtest)
lrtest(model)
Likelihood ratio test
Model 1: Result ~ Sex + County + Sex:County
Model 2: Result ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 8 -115.39
2 15 -125.39 7 20.009 0.005551 **
Post-hoc analysis
At the time of writing, the lsmeans package cannot be used with vglm models.
One option for post-hoc analysis would be to conduct analyses on reduced models, including only two levels of a factor. For example, if the variable County x Sex term had been significant, the following code could be used to create a reduced dataset with only Bloom–Female and Bloom–Male, and analyze this data with vglm.
Data.b = Data[Data$County=="Bloom" &
(Data$Sex=="Female"| Data$Sex=="Male")
, ]
Data.b$County = factor(Data.b$County)
Data.b$Sex = factor(Data.b$Sex)
summary(Data.b)
County Sex Result Count
Bloom:4 Female:2 Pass:2 Min. : 5.0
Male :2 Fail:2 1st Qu.: 6.5
Median : 8.0
Mean : 9.5
3rd Qu.:11.0
Max. :17.0
library(VGAM)
model.b = vglm(Result ~ Sex,
family=multinomial(refLevel=1),
weights = Count,
data = Data.b)
lrtest(model.b)
Likelihood ratio test
#Df LogLik Df Chisq Pr(>Chisq)
1 2 -23.612
2 3 -25.864 1 4.5041 0.03381 *
Summary table of results
Comparison p-value
Bloom–Female - Bloom–Male 0.034
Cobblestone–Female - Cobblestone–Male 0.0052
Dougal–Female - Dougal–Male 0.44
Heimlich–Female - Heimlich–Male 0.14
p.value = c(0.034, 0.0052, 0.44, 0.14)
p.adj = p.adjust(p.value,
method = "fdr")
p.adj = signif(p.adj,
2)
p.adj
[1] 0.068 0.021 0.440 0.190
Comparison p-value p.adj
Bloom–Female - Bloom–Male 0.034 0.068
Cobblestone–Female - Cobblestone–Male 0.0052 0.021
Dougal–Female - Dougal–Male 0.44 0.44
Heimlich–Female - Heimlich–Male 0.14 0.19
Mixed-effects logistic regression example
A mixed-effects generalized linear model, as in the case of logistic regression with random effects, can be specified. This example revisits Hayley Smith’s friendly lawn care course, for which we had used Cochran’s Q test.
This example uses the glmer function in the package mle4, which can fit binomial dependent variables, with the binomial family of models, or other families of models. As far as I know, it will not fit multinomial regression. For more information on families of models, see ?family and ?glm. For more information on glmer, see ?glmer. For a list of methods used to extract information from the model object, see methods(class="merMod").
The Anova function in the car package will provide chi-square tests for individual factors in the model. The user might prefer to use likelihood ratio tests with nested models instead of chi-square tests. This could be accomplished with the anova function, or by specifying the second model with the null= option in the nagelkerke function.
Input = ("
Practice Student Response
SoilTest a Yes
SoilTest b No
SoilTest c Yes
SoilTest d Yes
SoilTest e No
SoilTest f Yes
SoilTest g Yes
SoilTest h Yes
SoilTest i No
SoilTest j Yes
SoilTest k Yes
SoilTest l Yes
SoilTest m Yes
SoilTest n Yes
Irrigation a Yes
Irrigation b No
Irrigation c No
Irrigation d No
Irrigation e No
Irrigation f No
Irrigation g No
Irrigation h No
Irrigation i Yes
Irrigation j No
Irrigation k No
Irrigation l No
Irrigation m No
Irrigation n No
MowHeight a Yes
MowHeight b No
MowHeight c Yes
MowHeight d Yes
MowHeight e Yes
MowHeight f Yes
MowHeight g Yes
MowHeight h Yes
MowHeight i No
MowHeight j Yes
MowHeight k Yes
MowHeight l Yes
MowHeight m Yes
MowHeight n Yes
Clippings a Yes
Clippings b No
Clippings c No
Clippings d Yes
Clippings e Yes
Clippings f No
Clippings g No
Clippings h Yes
Clippings i Yes
Clippings j Yes
Clippings k Yes
Clippings l No
Clippings m Yes
Clippings n No
")
Data = read.table(textConnection(Input),header=TRUE)
### Order factors otherwise R will alphabetize them
Data$Practice = factor(Data$Practice,
levels=unique(Data$Practice))
Data$Response = factor(Data$Response,
levels=c("Yes", "No"))
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Mixed-effects logistic regression
library(lme4)
model = glmer(Response ~ Practice + (1 | Student),
data = Data,
family = binomial,
nAGQ = 1) ### Must be 1 to
compare with glm model objects
summary(model)
library(car)
Anova(model,
type="II",
test="Chisq")
Analysis of Deviance Table (Type II Wald chisquare tests)
Chisq Df Pr(>Chisq)
Practice 10.713 3 0.01338 *
### p-value for model and pseudo R-squared
model.null = glm(Response ~ 1,
data = Data,
family = binomial)
library(rcompanion)
nagelkerke(model,
model.null)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.258315
Cox and Snell (ML) 0.295184
Nagelkerke (Cragg and Uhler) 0.397900
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-4 -9.7949 19.59 0.00060164
### Test the random effects
model.fixed = glm(Response ~ Practice,
data = Data,
family = binomial)
anova(model, model.fixed)
model.fixed: Response ~ Practice
model: Response ~ Practice + (1 | Student)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
model.fixed 4 64.636 72.738 -28.318 56.636
model 5 66.247 76.374 -28.124 56.247 0.3889 1 0.5329
Post-hoc analysis
Post-hoc analysis can be conducted with the lsmeans package. Note that estimates for lsmeans, differences, or standard errors are on a log or logit scale. For notes on using lsmeans with different model types, see ?lsmeans::models.
library(multcompView)
library(lsmeans)
marginal = lsmeans(model,
pairwise ~ Practice,
adjust="tukey")
### Tukey adjustment
for multiple comparisons
marginal
$contrasts
contrast estimate SE df z.ratio p.value
SoilTest - Irrigation -3.4004979 1.1888138 NA -2.860 0.0220
SoilTest - MowHeight 0.5256967 1.0384143 NA 0.506 0.9576
SoilTest - Clippings -1.1165227 0.9087443 NA -1.229 0.6086
Irrigation - MowHeight 3.9261946 1.2853129 NA 3.055 0.0121
Irrigation - Clippings 2.2839752 1.0431115 NA 2.190 0.1261
MowHeight - Clippings -1.6422194 1.0071891 NA -1.630 0.3614
Results are given on the log (not the response) scale.
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log scale
cld(marginal,
alpha=0.05,
Letters=letters, ### Use lower-case letters
for .group
adjust="tukey") ### Tukey
adjustment for multiple comparisons
MowHeight -1.9640077 0.8712261 NA -4.1341580 0.2061426 a
SoilTest -1.4383110 0.7498776 NA -3.3061924 0.4295705 a
Clippings -0.3217883 0.6055520 NA -1.8301670 1.1865905 ab
Irrigation 1.9621870 0.8672121 NA -0.1979648 4.1223388 b
Results are given on the logit (not the response) scale.
Confidence level used: 0.95
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log scale
significance level used: alpha = 0.05
Other tools for categorical analysis
The vcd package has several function that are useful when working with categorical data. The package can be explored with the following vignette and the package help files.
Friendly, M. Working with categorical data with R and the vcd and vcdExtra packages. The Comprehensive R Archive Network.
if(!require(vcd)){install.packages("vcd")}
help(package="vcd")