[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Models for Nominal Data

Packages used in this chapter

 

The packages used in this chapter include:

•  car

•  multcompView

•  lsmeans

•  VGAM

•  lmtest

•  psych

•  lme4

•  MASS

•  rcompanion

 

The following commands will install these packages if they are not already installed:


if(!require(car)){install.packages("car")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(lsmeans)){install.packages("lsmeans")}
if(!require(VGAM)){install.packages("VGAM")}
if(!require(lmtest)){install.packages("lmtest")}
if(!require(psych)){install.packages("psych")}
if(!require(lme4)){install.packages("lme4")}
if(!require(MASS)){install.packages("MASS")}
if(!require(rcompanion)){install.packages("rcompanion")}


Log-linear models for tests of association

 

Analysis of contingency tables with two or more dimensions can be accomplished with log-linear models.  These will test for association. When there are multiple dimensions, this approach is sometimes called multiway frequency analysis.  It is analogous to chi-square tests for two-way tables or Cochran–Mantel–Haenszel test for three way tables.

 

One advantage of log-linear models is that they handle tables with a higher number of dimensions.

 

Another advantage is that the models can be specified to test for more complicated patterns of association.  When each term in the model is included, and there are no interactions, the model tests for mutual independence of the terms.  Models can also test for conditional independence and partial independence.

 

For more information on specifying models for conditional independence or partial independence, see:

 

Quick-R. 2014. Frequencies and Crosstabs. www.statmethods.net/stats/frequencies.html.

 

Agresti, A. 2007. An Introduction to Categorical Data Analysis 2nd Edition.  Wiley-Interscience.

 

For a more complete discussion of log-linear models for multiway frequency analysis see:

 

Tabachnick, B.G. and S.F. Fidell. 2001. Using multivariate statistics. 4th Edition. Allyn and Bacon, Boston, MA.

 

Zero frequencies in cells may cause the maximum likelihood estimation to fail, or may cause bias in the results.

 

Structural zeros can be handled with the start argument in the loglm function.  See library(MASS); ?loglm , library(MASS); ?loglm1 , and ?loglin .

 

Log-linear model example

The following example revisits Alexander Anderson’s data of passing grades by sex within counties, for which we had used the Cochran–Mantel–Haenszel test.  Here we use a log-linear model with the loglm function in the MASS package.


Input = ("
County       Sex     Result  Count
Bloom        Female  Pass     9
Bloom        Female  Fail     5
Bloom        Male    Pass     7
Bloom        Male    Fail    17
Cobblestone  Female  Pass    11
Cobblestone  Female  Fail     4
Cobblestone  Male    Pass     9
Cobblestone  Male    Fail    21
Dougal       Female  Pass     9
Dougal       Female  Fail     7
Dougal       Male    Pass    19
Dougal       Male    Fail     9
Heimlich     Female  Pass    15
Heimlich     Female  Fail     8
Heimlich     Male    Pass    14
Heimlich     Male    Fail    17
")

Data = read.table(textConnection(Input),header=TRUE)


### Order factors otherwise R will alphabetize them

Data$County = factor(Data$County,
                     levels=unique(Data$County))

Data$Sex    = factor(Data$Sex,
                     levels=unique(Data$Sex))

Data$Result = factor(Data$Result,
                     levels=unique(Data$Result))


###  Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Create table


Table = xtabs(Count ~ Sex + Result + County,
              data=Data)

ftable(Table)                     # Display a flattened table


              County Bloom Cobblestone Dougal Heimlich
Sex    Result                                        
Female Pass              9          11      9       15
       Fail              5           4      7        8
Male   Pass              7           9     19       14
       Fail             17          21      9       17


Log-linear model

The model here tests for mutual independence of the three variables.


library(MASS)

loglm( ~ Sex + Result + County,
       Table)


Statistics:
                      X^2 df  P(> X^2)
Likelihood Ratio 20.96662 10 0.0213275
Pearson          20.79050 10 0.0226027


Post-hoc analysis

For a post-hoc analysis, you could slice up a multi-dimensional table in any way that makes sense for the hypotheses you want to test.  Here, we’ll look for an association of Sex and Result within each county.

 

Third dimension in our table is County, so Table[,,1] yields the piece of the table where the third dimension is equal to 1, that is, where County = Bloom.  And so on.


Bloom = Table[,,1]

loglm( ~ Sex + Result,
       Bloom)


                      X^2 df   P(> X^2)
Likelihood Ratio 4.504069  1 0.03381429
Pearson          4.473688  1 0.03442062


Cobblestone = Table[,,2]

loglm( ~ Sex + Result,
       Cobblestone)


                      X^2 df    P(> X^2)
Likelihood Ratio 7.777229  1 0.005290890
Pearson          7.605000  1 0.005820666


Dougal = Table[,,3]

loglm( ~ Sex + Result,
       Dougal)


                       X^2 df  P(> X^2)
Likelihood Ratio 0.5876125  1 0.4433438
Pearson          0.5927934  1 0.4413410


Heimlich = Table[,,4]

loglm( ~ Sex + Result,
       Heimlich)


                     X^2 df  P(> X^2)
Likelihood Ratio 2.158817  1 0.1417538
Pearson          2.136182  1 0.1438595


Summary of analysis from Cochran–Mantel–Haenszel tests and log-linear models


County        C-H-M     log-linear
              p-value   p-value (lr)
Bloom         0.0468    0.03381
Cobblestone   0.0102    0.00529
Dougal        0.5230    0.4433
Heimlich      0.1750    0.14175


Logistic regression

 

Logistic regression is a common option for building models with a nominal dependent variable.  For standard logistic regression, the dependent variable must have only two levels.  That is, it must be dichotomous.

 

The following references should be useful for conducting logistic regression.

 

“Simple Logistic Regression” in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_06.html.

 

“Multiple Logistic Regression” in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_07.html.

 

“Simple logistic regression” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/simplelogistic.html.

 

“Multiple logistic regression” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/multiplelogistic.html.

 

 

Logistic regression example

Note that model assumptions and pitfalls of logistic regression are not discussed here.  The reader is urged to read the preceding references or other appropriate materials before proceeding.

 

The following example revisits Alexander Anderson pesticide safety training course data across gender and four counties, for which we had used the Cochran–Mantel–Haenszel Test.


Data = read.table(header=TRUE, stringsAsFactors=TRUE, text="

County       Gender  Result  Count
Bloom        Female  Pass     9
Bloom        Female  Fail     5
Bloom        Male    Pass     7
Bloom        Male    Fail    17
Cobblestone  Female  Pass    11
Cobblestone  Female  Fail    4
Cobblestone  Male    Pass    9
Cobblestone  Male    Fail    21
Dougal       Female  Pass     9
Dougal       Female  Fail     7
Dougal       Male    Pass    19
Dougal       Male    Fail     9
Heimlich     Female  Pass    15
Heimlich     Female  Fail     8
Heimlich     Male    Pass    14
Heimlich     Male    Fail    17
")


### Order factors otherwise R will alphabetize them

Data$County = factor(Data$County,
                     levels=unique(Data$County))

Data$Gender = factor(Data$Gender,
                     levels=unique(Data$Gender))

Data$Result = factor(Data$Result,
                     levels=unique(Data$Result))


###  Check the data frame


library(psych)

headTail(Data)

str(Data)

summary(Data)


Logistic regression

Logistic regression can be conducted with the glm function in the native stats package.  The summary function will return coefficients for the model.

 

The Anova function in the car package will produce an analysis of deviance table with either likelihood ratio, Wald, or F tests.

 

To derive a p-value and a pseudo R-squared value for the model it is recommended to convert the data to long form before fitting the model. 


model = glm(Result ~ County + Gender + County:Gender,
            weight = Count,
            data = Data,
            family = binomial(link="logit")
            )

summary(model)


library(car)

Anova(model,
      type="II",
      test="LR")


Analysis of Deviance Table (Type II tests)

            LR Chisq Df Pr(>Chisq)  
County        4.9627  3   0.174548  
Gender        7.8108  1   0.005193 **
County:Gender 7.2169  3   0.065296 .


library(tidyr)

Long = uncount(Data, Count)

str(Long)


'data.frame':     181 obs. of  3 variables:
  $ County: Factor w/ 4 levels "Bloom","Cobblestone",..: 1 1 1 1 1 1 1 1 1 1 ...
  $ Gender: Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
  $ Result: Factor w/ 2 levels "Pass","Fail": 1 1 1 1 1 1 1 1 1 2 ...


model = glm(Result ~ County + Gender + County:Gender,
            data = Long,
            family = binomial())

library(car)

Anova(model)


Analysis of Deviance Table (Type II tests)

              LR Chisq Df Pr(>Chisq)  
County          4.9627  3   0.174548  
Gender          7.8108  1   0.005193 **
County:Gender   7.2169  3   0.065296 .


library(rcompanion)

nagelkerke(model)


$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                            0.0797857
Cox and Snell (ML)                  0.1046550
Nagelkerke (Cragg and Uhler)        0.1395750

$Likelihood.ratio.test
Df.diff LogLik.diff  Chisq   p.value
     -7     -10.004 20.009 0.0055508


library(rcompanion)

efronRSquared(model)


EfronRSquared
0.108


countRSquare(model)


$Result
  Count.R2  Count.R2.corrected
1    0.652               0.284


$Confusion.matrix
     Predicted
Actual   0   1 Sum
   0    63  30  93
   1    33  55  88
   Sum  96  85 181


Post-hoc analysis

Post-hoc analysis can be conducted with the emmeans package.  Here, the type= "response " option is used to report results on a probability scale.  For notes on using emmeans with different model types, see ?emmeans::models.

 

Normally, because the County x Gender interaction was not significant, we wouldn’t want to explore the mean separations of the County x Gender interaction, but here the results are used to compare Gender within each level of County to compare to the other methods in this chapter.

 

See ?emmeans::summary for p-value adjustment options in emmeans.


library(multcompView)

library(emmeans)

marginal = emmeans(model, ~ Gender|County, type="response")

marginal

pairs(marginal)     


County = Bloom:
 contrast      odds.ratio    SE  df null z.ratio p.value
 Female / Male      0.229 0.164 Inf    1  -2.060  0.0394

County = Cobblestone:
 contrast      odds.ratio    SE  df null z.ratio p.value
 Female / Male      0.156 0.110 Inf    1  -2.630  0.0085

County = Dougal:
 contrast      odds.ratio    SE  df null z.ratio p.value
 Female / Male      1.642 1.061 Inf    1   0.767  0.4429

County = Heimlich:
 contrast      odds.ratio    SE  df null z.ratio p.value
 Female / Male      0.439 0.249 Inf    1  -1.450  0.1470

Tests are performed on the log odds ratio scale



cld(marginal, Letters=letters)


Summary table of results


Comparison                                p-value
Bloom–Female       - Bloom–Male           0.039
Cobblestone–Female - Cobblestone–Male     0.0085
Dougal–Female      - Dougal–Male          0.44
Heimlich–Female    - Heimlich–Male        0.15

 

Multinomial logistic regression

 

If the nominal dependent variable has more than two levels, multinomial logistic regression can be used. 

 

VGAM package

The VGAM package provides a flexible framework for building models with categorical data.  The vignette for the package provides an overview of the package, as well as brief overview of categorical analysis in R.

 

Yee, T.W.  The VGAM package for Categorical Data Analysis.  The Comprehensive R Archive Network.

 

The following article may also be helpful for those interested in using the VGAM package.

 

Yee, T.W. 2008. The VGAM Package. R News, 8(2), 28–39. URL. cran.r-project.org/doc/Rnews/Rnews_2008-2.pdf.

 

The VGAM package can be explored through the help files for the package.


if(!require(VGAM)){install.packages("VGAM")}

help(package="VGAM")

And multinomial regression is discussed specifically in the entry for multinomial.


library(VGAM)

?multinomial


mlogit and nnet packages

 

Two other packages that can perform multinomial regression are mlogit and nnet.  The packages can be explored with the following vignette, article, and the package help files.


if(!require(mlogit)){install.packages("mlogit")}
help(package="mlogit")


if(!require(nnet)){install.packages("nnet")}
help(package="nnet")


Multinomial regression example

The following example revisits Alexander Anderson’s data of passing grades by sex within counties, for which we had used the Cochran–Mantel–Haenszel Test.

 

Because the dependent variable, Result, has only two levels, it could be modeled with standard binomial regression.  However, a multinomial approach with VGAM will be used here as an example.  Note that in multinomial regression, a reference level for the dependent variable must be specified.  Coefficients of the model will change if a different reference level is specified.

 

Note that model assumptions and pitfalls of this approach are not discussed here.  The reader is urged to understand the assumptions of this kind of modeling before proceeding.

 

The summary function will provide estimates of coefficients and standard errors for the coefficients.  A p-value for the overall model as well as pseudo R-squared value is provided by the nagelkerke function.

 

The lrtest function in VGAM can be used to compare two nested models with likelihood ratio test.  If only one model is given, it provides a p-value for the overall model, compared with a null model.  The Anova function in the car package will provide chi-square tests for individual factors in the model.  The user might prefer to use likelihood ratio tests with nested models instead of chi-square tests.  This could be accomplished with the lrtest function in the lmtest package, or by specifying the second model with the null= option in the nagelkerke function.


Input = ("
County       Sex     Result  Count
Bloom        Female  Pass     9
Bloom        Female  Fail     5
Bloom        Male    Pass     7
Bloom        Male    Fail    17
Cobblestone  Female  Pass    11
Cobblestone  Female  Fail     4
Cobblestone  Male    Pass     9
Cobblestone  Male    Fail    21
Dougal       Female  Pass     9
Dougal       Female  Fail     7
Dougal       Male    Pass    19
Dougal       Male    Fail     9
Heimlich     Female  Pass    15
Heimlich     Female  Fail     8
Heimlich     Male    Pass    14
Heimlich     Male    Fail    17
")

Data = read.table(textConnection(Input),header=TRUE)


### Order factors otherwise R will alphabetize them

Data$County = factor(Data$County,
                     levels=unique(Data$County))

Data$Sex    = factor(Data$Sex,
                     levels=unique(Data$Sex))

Data$Result = factor(Data$Result,
                     levels=unique(Data$Result))


###  Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Multinomial regression


library(VGAM)

model = vglm(Result ~ Sex + County + Sex:County,
             family=multinomial(refLevel=1),
             weights = Count,
             data = Data)


summary(model)


library(car)

Anova(model,
      type="II",
      test="Chisq")


Analysis of Deviance Table (Type II tests)

Response: Result
           Df  Chisq Pr(>Chisq)  
Sex         1 6.7132    0.00957 **
County      3 4.1947    0.24120  
Sex:County  3 7.1376    0.06764 .


library(rcompanion)

nagelkerke(model)


$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                            0.0797857
Cox and Snell (ML)                  0.7136520
Nagelkerke (Cragg and Uhler)        0.7136520

$Likelihood.ratio.test
 Df.diff LogLik.diff  Chisq   p.value
       7     -10.004 20.009 0.0055508


library(lmtest)

lrtest(model)


Likelihood ratio test

Model 1: Result ~ Sex + County + Sex:County
Model 2: Result ~ 1

  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   8 -115.39                       
2  15 -125.39  7 20.009   0.005551 **


Post-hoc analysis

At the time of writing, the lsmeans package cannot be used with vglm models.

 

One option for post-hoc analysis would be to conduct analyses on reduced models, including only two levels of a factor.  For example, if the variable County x Sex term had been significant, the following code could be used to create a reduced dataset with only Bloom–Female and Bloom–Male, and analyze this data with vglm.


Data.b        = Data[Data$County=="Bloom" &
                     (Data$Sex=="Female"| Data$Sex=="Male") , ]

Data.b$County = factor(Data.b$County)
Data.b$Sex    = factor(Data.b$Sex)


summary(Data.b)


   County      Sex     Result      Count    
 Bloom:4   Female:2   Pass:2   Min.   : 5.0 
           Male  :2   Fail:2   1st Qu.: 6.5 
                               Median : 8.0 
                               Mean   : 9.5 
                               3rd Qu.:11.0 
                               Max.   :17.0 


library(VGAM)

model.b = vglm(Result ~ Sex,
                family=multinomial(refLevel=1),
                weights = Count,
                data = Data.b)

lrtest(model.b)


Likelihood ratio test

  #Df  LogLik Df  Chisq Pr(>Chisq) 
1   2 -23.612                      
2   3 -25.864  1 4.5041    0.03381 *


Summary table of results


Comparison                                p-value
Bloom–Female       - Bloom–Male           0.034
Cobblestone–Female - Cobblestone–Male     0.0052
Dougal–Female      - Dougal–Male          0.44
Heimlich–Female    - Heimlich–Male        0.14


p.value = c(0.034, 0.0052, 0.44, 0.14)

p.adj = p.adjust(p.value,
                 method = "fdr")


p.adj = signif(p.adj,
               2)

p.adj


[1] 0.068 0.021 0.440 0.190


Comparison                                p-value  p.adj
Bloom–Female       - Bloom–Male           0.034    0.068
Cobblestone–Female - Cobblestone–Male     0.0052   0.021
Dougal–Female      - Dougal–Male          0.44     0.44
Heimlich–Female    - Heimlich–Male        0.14     0.19


Mixed-effects logistic regression example

 

A mixed-effects generalized linear model, as in the case of logistic regression with random effects, can be specified.  This example revisits Hayley Smith’s friendly lawn care course, for which we had used Cochran’s Q test.

 

This example uses the glmer function in the package mle4, which can fit binomial dependent variables, with the binomial family of models, or other families of models.  As far as I know, it will not fit multinomial regression.  For more information on families of models, see ?family and ?glm.  For more information on glmer, see ?glmer.  For a list of methods used to extract information from the model object, see methods(class="merMod").

 

The Anova function in the car package will provide chi-square tests for individual factors in the model.  The user might prefer to use likelihood ratio tests with nested models instead of chi-square tests.  This could be accomplished with the anova function, or by specifying the second model with the null= option in the nagelkerke function.


Input = ("
Practice     Student   Response
SoilTest     a         Yes
SoilTest     b         No
SoilTest     c         Yes
SoilTest     d         Yes
SoilTest     e         No
SoilTest     f         Yes
SoilTest     g         Yes
SoilTest     h         Yes
SoilTest     i         No
SoilTest     j         Yes
SoilTest     k         Yes
SoilTest     l         Yes
SoilTest     m         Yes
SoilTest     n         Yes
Irrigation   a         Yes
Irrigation   b         No
Irrigation   c         No
Irrigation   d         No
Irrigation   e         No
Irrigation   f         No
Irrigation   g         No
Irrigation   h         No
Irrigation   i         Yes
Irrigation   j         No
Irrigation   k         No
Irrigation   l         No
Irrigation   m         No
Irrigation   n         No
MowHeight    a         Yes
MowHeight    b         No
MowHeight    c         Yes
MowHeight    d         Yes
MowHeight    e         Yes
MowHeight    f         Yes
MowHeight    g         Yes
MowHeight    h         Yes
MowHeight    i         No
MowHeight    j         Yes
MowHeight    k         Yes
MowHeight    l         Yes
MowHeight    m         Yes
MowHeight    n         Yes
Clippings    a         Yes
Clippings    b         No
Clippings    c         No
Clippings    d         Yes
Clippings    e         Yes
Clippings    f         No
Clippings    g         No
Clippings    h         Yes
Clippings    i         Yes
Clippings    j         Yes
Clippings    k         Yes
Clippings    l         No
Clippings    m         Yes
Clippings    n         No
")

Data = read.table(textConnection(Input),header=TRUE)


### Order factors otherwise R will alphabetize them

Data$Practice = factor(Data$Practice,
                   levels=unique(Data$Practice))

Data$Response = factor(Data$Response,
                   levels=c("Yes", "No"))


###  Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Mixed-effects logistic regression


library(lme4)

model = glmer(Response ~ Practice + (1 | Student),
              data = Data,
              family = binomial,
              nAGQ = 1)      ### Must be 1 to compare with glm model objects


summary(model)

library(car)

Anova(model,
      type="II",
      test="Chisq")


Analysis of Deviance Table (Type II Wald chisquare tests)

          Chisq Df Pr(>Chisq) 
Practice 10.713  3    0.01338 *


### p-value for model and pseudo R-squared


model.null = glm(Response ~ 1,
              data = Data,
              family = binomial)

library(rcompanion)

nagelkerke(model,
           model.null)


$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.258315
Cox and Snell (ML)                   0.295184
Nagelkerke (Cragg and Uhler)         0.397900

$Likelihood.ratio.test
 Df.diff LogLik.diff Chisq    p.value
      -4     -9.7949 19.59 0.00060164


### Test the random effects

model.fixed = glm(Response ~ Practice,
              data = Data,
              family = binomial)

anova(model, model.fixed)


model.fixed: Response ~ Practice
model: Response ~ Practice + (1 | Student)

            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
model.fixed  4 64.636 72.738 -28.318   56.636                        
model        5 66.247 76.374 -28.124   56.247 0.3889      1     0.5329


Post-hoc analysis

Post-hoc analysis can be conducted with the lsmeans package.  Note that estimates for lsmeans, differences, or standard errors are on a log or logit scale.  For notes on using lsmeans with different model types, see ?lsmeans::models


library(multcompView)

library(lsmeans)

marginal = lsmeans(model,
                      pairwise ~ Practice,
                      adjust="tukey")    
                             ### Tukey adjustment for multiple comparisons

marginal


$contrasts
 contrast                 estimate        SE df z.ratio p.value
 SoilTest - Irrigation  -3.4004979 1.1888138 NA  -2.860  0.0220
 SoilTest - MowHeight    0.5256967 1.0384143 NA   0.506  0.9576
 SoilTest - Clippings   -1.1165227 0.9087443 NA  -1.229  0.6086
 Irrigation - MowHeight  3.9261946 1.2853129 NA   3.055  0.0121
 Irrigation - Clippings  2.2839752 1.0431115 NA   2.190  0.1261
 MowHeight - Clippings  -1.6422194 1.0071891 NA  -1.630  0.3614

Results are given on the log (not the response) scale.
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log scale


cld(marginal,
    alpha=0.05,
    Letters=letters,  ### Use lower-case letters for .group
    adjust="tukey")   ### Tukey adjustment for multiple comparisons


 MowHeight  -1.9640077 0.8712261 NA -4.1341580 0.2061426  a   
 SoilTest   -1.4383110 0.7498776 NA -3.3061924 0.4295705  a   
 Clippings  -0.3217883 0.6055520 NA -1.8301670 1.1865905  ab  
 Irrigation  1.9621870 0.8672121 NA -0.1979648 4.1223388   b  

Results are given on the logit (not the response) scale.
Confidence level used: 0.95

P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log scale

significance level used: alpha = 0.05


Other tools for categorical analysis

 

The vcd package has several function that are useful when working with categorical data.  The package can be explored with the following vignette and the package help files.

 

Friendly, M. Working with categorical data with R and the vcd and vcdExtra packages. The Comprehensive R Archive Network.


if(!require(vcd)){install.packages("vcd")}

help(package="vcd")