[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Association Tests for Nominal Variables

These tests for nominal variables are used to determine if two nominal variables are associated.  Sometimes the term “independent” is used to mean that there is no association.

 

In general, there are no assumptions about the distribution of data for these tests.

 

Note that for these tests of association there shouldn’t be paired values.  For example, if experimental units—the things you are counting—are “students before” and “students after”, or “left hands” and “right hands”, the tests in the chapter Tests for Paired Nominal Data may be more appropriate.

 

Also note that these tests will not be accurate if there are “structural zeros” in the contingency table.  If you were counting pregnant and non-pregnant individuals across categories like male and female, the male–pregnant cell may contain a structural zero if you assume your population cannot have pregnant males.

 

Low cell counts

The results of chi-square tests and G-tests can be inaccurate if cell counts are low.  A rule of thumb is that all cell counts should be 5 or greater for chi-square- and G-tests.  For a more complete discussion, see McDonald in the “Optional Readings” section on what constitutes low cell counts. 

 

One approach when there are low counts is to use exact tests, such as Fisher’s exact test, which are not bothered by low cell counts. 

 

Another approach is to apply a continuity correction to the test.  The chisq.test function automatically applies the Yates's continuity correction for 2 x 2 tables.  The GTest function has options for Yates or Williams corrections.

 

Fisher’s exact test

Technically, Fisher’s exact test assumes that the table has fixed margins, in at least in one dimension.  That is, if the table has counts of some nominal variable for both males and females, the test assumes that the number of males tested and the number of females tested was determined ahead of time. 

 

It appears to me that this assumption is commonly ignored, but I don’t know if this is because of ignorance of this assumption, or just that some people assume that advantages of the exact test outweigh using it in cases when this assumption is violated.

 

Choosing among tests

If there are not low cell counts, using G-test or chi-square test is fine.  G-test is probably technically a better test than chi-square.  The advantage of chi-square tests is that your audience may be more familiar with them.  That being said, some authors recommend the using Fisher’s exact test routinely for tables of less than, say, 1000 observations, with the assumption that exact tests yield more accurate p-values.

 

 

Appropriate data

•  Two nominal variables with two or more levels each.

•  Experimental units aren’t paired.

•  There are no structural zeros in the contingency table.

•  G-test and chi-square test may not be appropriate if there are cells with low counts in them.

 

Hypotheses

•  Null hypothesis:  There is no association between the two variables.

•  Alternative hypothesis (two-sided): There is an association between the two variables.

 

Interpretation

Significant results can be reported as “There was a significant association between variable A and variable B.”

 

Post-hoc analysis

Post hoc analysis for tests on a contingency table larger than 2 x 2 can be conducted by conducting tests for the component 2 x 2 tables.  A correction for multiple tests should be applied.

 

Other notes and alternative tests

•  For paired data, see the tests in the chapter Tests for Paired Nominal Data.

•  For matrices with 3 dimensions, the Cochran–Mantel–Haenszel test can be used.  

 

Packages used in this chapter

 

The packages used in this chapter include:

•  DescTools

•  multcompView

•  rcompanion

 

The following commands will install these packages if they are not already installed:


if(!require(DescTools)){install.packages("DescTools")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(rcompanion)){install.packages("rcompanion")}

 

Association tests for nominal variables example

 

Alexander Anderson runs the pesticide safety training course in four counties.  Students must pass in order to obtain their pesticide applicator’s license.  He wishes to see if there is an association between the county in which the course was held and the rate of passing the test.  The following are his data.

 

County               Pass   Fail
Bloom County         21      5
Cobblestone County    6     11
Dougal County         7      8
Heimlich County      27      5


Reading the data as a matrix

 

Input =("
County               Pass   Fail
Bloom                21      5
Cobblestone           6     11
Dougal                7      8
Heimlich             27      5
")

Matrix = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matrix


Fisher Exact test of association


fisher.test(Matrix)


Fisher's Exact Test for Count Data

p-value = 0.000668
alternative hypothesis: two.sided


Post-hoc analysis

Post-hoc analysis can be conducted with pairwise Fisher’s exact tests.  My custom function pairwiseNominalIndependence can be used to conduct this analysis.


### Order matrix

Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal", "Cobblestone")),]

Matrix


### Pairwise tests of association

library(rcompanion)

PT = pairwiseNominalIndependence(Matrix,
                                 compare = "row",
                                 fisher  = TRUE,
                                 gtest   = FALSE,
                                 chisq   = FALSE,
                                 method  = "fdr",  # see ?p.adjust for options
                                 digits  = 3)

PT


              Comparison p.Fisher p.adj.Fisher
1       Heimlich : Bloom 0.740000      0.74000
2      Heimlich : Dougal 0.013100      0.02620
3 Heimlich : Cobblestone 0.000994      0.00596
4         Bloom : Dougal 0.037600      0.05640
5    Bloom : Cobblestone 0.003960      0.01190
6   Dougal : Cobblestone 0.720000      0.74000



library(rcompanion)

cldList(comparison = PT$Comparison,
        p.value    = PT$p.adj.Fisher,
        threshold  = 0.05)


        Group Letter MonoLetter
1    Heimlich      a        a 
2       Bloom     ab        ab
3      Dougal     bc         bc
4 Cobblestone      c          c


The table of adjusted p-values can be summarized to a table of letters indicating which treatments are not significantly different.


County               Percent passing   Letter
Heimlich County      84%               a
Bloom County         81                ab
Dougal County        47                 bc
Cobblestone County   35                  c

Counties sharing a letter are not significantly different by Fisher exact test, with p-values adjusted by FDR method for multiple comparisons (Benjamini–Hochberg false discovery rate).


This table of letters can also be found using my pairwiseNominalMatrix function along with the multcompLetters function in the multcompView package.


### Order matrix


Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal", "Cobblestone")),]

Matrix


### Pairwise tests of association

library(rcompanion)

PM = pairwiseNominalMatrix(Matrix,
                           compare = "row",
                           fisher  = TRUE,
                           gtest   = FALSE,
                           chisq   = FALSE,
                           method  = "fdr",  # see ?p.adjust for options
                           digits  = 3)
PM


$Test
[1] "Fisher exact test"

$Method
[1] "fdr"

$Adjusted
            Heimlich  Bloom Dougal Cobblestone
Heimlich     1.00000 0.7400 0.0262     0.00596
Bloom        0.74000 1.0000 0.0564     0.01190
Dougal       0.02620 0.0564 1.0000     0.74000
Cobblestone  0.00596 0.0119 0.7400     1.00000


library(multcompView)  
  
multcompLetters(PM$Adjusted,  
                compare="<",  
                threshold=0.05,  ### p-value to use as significance threshold  
                Letters=letters,  
                reversed = FALSE)


   Heimlich       Bloom      Dougal Cobblestone
        "a"        "ab"        "bc"         "c"


G-test of association


library(DescTools)

GTest(Matrix)


Log likelihood ratio (G-test) test of independence without correction

G = 17.14, X-squared df = 3, p-value = 0.0006615


Post-hoc analysis


### Order matrix

Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal", "Cobblestone")),]

Matrix

### Pairwise tests of association

pairwiseNominalIndependence(Matrix,
                            compare = "row",
                            fisher  = FALSE,
                            gtest   = TRUE,
                            chisq   = FALSE,
                            method  = "fdr",  # see ?p.adjust for options
                            digits  = 3)


              Comparison  p.Gtest p.adj.Gtest
1       Heimlich : Bloom 0.718000     0.71800
2      Heimlich : Dougal 0.008300     0.01660
3 Heimlich : Cobblestone 0.000506     0.00304
4         Bloom : Dougal 0.024800     0.03720
5    Bloom : Cobblestone 0.002380     0.00714
6   Dougal : Cobblestone 0.513000     0.61600



The table of adjusted p-values can be summarized to a table of letters indicating which treatments are not significantly different.


County               Percent passing   Letter
Heimlich County      84%               a
Bloom County         81                a
Dougal County        47                  b
Cobblestone County   35                  b

Counties sharing a letter are not significantly different by G-test for association, with p-values adjusted by FDR method for multiple comparisons (Benjamini–Hochberg false discovery rate).



Chi-square test of association


chisq.test(Matrix)


Pearson's Chi-squared test

X-squared = 17.32, df = 3, p-value = 0.0006072


Post-hoc analysis


### Order matrix

Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal", "Cobblestone")),]

Matrix

### Pairwise tests of association

library(rcompanion)

pairwiseNominalIndependence(Matrix,
                            compare = "row",
                            fisher  = FALSE,
                            gtest   = FALSE,
                            chisq   = TRUE,
                            method  = "fdr",  # see ?p.adjust for options
                            digits  = 3)


              Comparison p.Chisq p.adj.Chisq
1       Heimlich : Bloom 0.99000     0.99000
2      Heimlich : Dougal 0.01910     0.03820
3 Heimlich : Cobblestone 0.00154     0.00924
4         Bloom : Dougal 0.05590     0.08380
5    Bloom : Cobblestone 0.00707     0.02120
6   Dougal : Cobblestone 0.77000     0.92400


The table of adjusted p-values can be summarized to a table of letters indicating which treatments are not significantly different.


County               Percent passing   Letter
Heimlich County      84%               a
Bloom County         81                ab
Dougal County        47                 bc
Cobblestone County   35                  c

Counties sharing a letter are not significantly different by chi-square test of association, with p-values adjusted by FDR method for multiple comparisons (Benjamini–Hochberg false discovery rate).

Optional readings


“Small numbers in chi-square and G–tests”
in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/small.html.

 

References

 

“Fisher’s Exact Test of Independence” in Mangiafico, S.S. 2015a. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/b_07.html.

 

“G–test of Independence” in Mangiafico, S.S. 2015b. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/b_06.html.

 

“Chi-square Test of Independence” in Mangiafico, S.S. 2015c. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/b_05.html.

 

“Small Numbers in Chi-square and G–tests” in Mangiafico, S.S. 2015d. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/b_08.html.

 

Exercises M


1. Considering Alexander Anderson’s data,

Numerically, which county had the greatest percentage of passing students?

Numerically, which county had the greatest number of total students?

Was there an association between county and student success?

Statistically, which counties performed the best?  Which performed the worst?  Be sure to indicate which post-hoc test you are using.

 

Plot the data in an appropriate way.  What does the plot suggest to you?

 

 

2. Ryuk and Rem held a workshop on planting habitat for pollinators like bees and butterflies.  They wish to know if there is an association between the profession of the attendees and their willingness to undertake a conservation planting.  The following are the data.

 

             Will plant?
Profession   Yes   No
Homeowner    13    14
Landscaper   27     6
Farmer        7    19
NGO           6     6

 

For each of the following, answer the question, and show the output from the analyses you used to answer the question.

 

Numerically, which profession had the greatest percentage of answering yes?

Numerically, which profession had the greatest number of total attendees?

Was there an association between profession and willingness to undertake a conservation planting?

Statistically, which professions were most willing?  Which were least willing?  Be sure to indicate which post-hoc test you are using.

 

Plot the data in an appropriate way.  What does the plot suggest to you?