Advertisement

These tests for nominal variables are used to determine if two nominal variables are associated. Sometimes the term “independent” is used to mean that there is no association.

In general, there are no assumptions about the distribution of data for these tests.

Note that for these tests of association there shouldn’t be
paired values. For example, if experimental units—the things you are
counting—are “students before” and “students after”, or “left hands” and “right
hands”, the tests in the chapter *Tests for Paired Nominal Data* may be
more appropriate.

Also note that these tests will not be accurate if there are “structural zeros” in the contingency table. If you were counting pregnant and non-pregnant individuals across categories like male and female, the male–pregnant cell may contain a structural zero if you assume your population cannot have pregnant males.

#### Low cell counts

The results of chi-square tests and G-tests can be inaccurate if expected cell counts are low. A rule of thumb is that all expected counts should be 5 or greater for chi-square- and G-tests. For a more complete discussion, see McDonald in the “Optional Readings” section on what constitutes low cell counts.

For tables larger than 2 x 2, a rule of thumb is that all expected counts are 1 or greater and that no more than 20% of expected counts are less than 5.

One approach when there are low expected counts is to use exact tests, such as Fisher’s exact test, which are not bothered by low cell counts.

Another approach is to apply a continuity correction to the
test. The *chisq.test* function automatically applies the Yates's
continuity correction for 2 x 2 tables. The *GTest* function has options
for Yates or Williams corrections.

#### Fisher’s exact test

Technically, Fisher’s exact test assumes that the table has fixed margins, in at least in one dimension. As one example, if the table has counts of males and females, the test could assume that the number of males tested and the number of females tested was determined ahead of time.

It appears to me that this assumption is commonly ignored, but I don’t know if this is because of ignorance of this assumption, or just that some people assume that advantages of the exact test outweigh any violation of the assumption.

#### Choosing among tests

If there are not low expected counts, using G-test or chi-square test is fine. The advantage of chi-square tests is that your audience may be more familiar with them. That being said, some authors recommend the using Fisher’s exact test routinely unless the number of observations is so great that the analysis takes too long.

##### Appropriate data

• Two nominal variables with two or more levels each.

• Experimental units aren’t paired.

• There are no structural zeros in the contingency table.

• G-test and chi-square test may not be appropriate if there are cells with low expected counts.

##### Hypotheses

• Null hypothesis: There is no association between the two variables.

• Alternative hypothesis (two-sided): There is an association between the two variables.

##### Interpretation

Significant results can be reported as “There was a significant association between variable A and variable B.”

##### Post-hoc analysis

Post hoc analysis for tests on a contingency table larger than 2 x 2 can be conducted by conducting tests for the component 2 x n tables. A correction for multiple tests should be applied.

##### Other notes and alternative tests

• For paired data, see the tests in the chapter *Tests for
Paired Nominal Data*.

• For tables with 3 dimensions, the Cochran–Mantel–Haenszel test can be used.

### Packages used in this chapter

The packages used in this chapter include:

• DescTools

• multcompView

• rcompanion

The following commands will install these packages if they are not already installed:

if(!require(DescTools)){install.packages("DescTools")}

if(!require(multcompView)){install.packages("multcompView")}

if(!require(rcompanion)){install.packages("rcompanion")}

### Association tests for nominal variables example

Alexander Anderson runs the pesticide safety training course in four counties. Students must pass in order to obtain their pesticide applicator’s license. He wishes to see if there is an association between the county in which the course was held and the rate of passing the test. The following are his data.

County Pass Fail

Bloom County 21 5

Cobblestone County 6 11

Dougal County 7 8

Heimlich County 27 5

#### Reading the data as a matrix

Input =("

County Pass Fail

Bloom 21 5

Cobblestone 6 11

Dougal 7 8

Heimlich 27 5

")

Matrix = as.matrix(read.table(textConnection(Input),

header=TRUE,

row.names=1))

Matrix

#### Expected cell counts

The *chisq.test* function can be used to identify the
expected counts for a contingency table.

Note in the results here that one cell has an expected count below 5, but that all expected counts are at least 1, and that cells with expected counts below 5 are less than 20% of cells. (1 / 8 cells = 13%).

Test = chisq.test(Matrix)

Test$expected

Pass Fail

Bloom 17.62222 8.377778

Cobblestone 11.52222 5.477778

Dougal 10.16667 4.833333

Heimlich 21.68889 10.311111

#### Effect size

See the chapter *Measures of Association for Nominal
Variables* for a discussion of effect size statistics and their
interpretation for contingency tables of nominal variables.

library(rcompanion)

cramerV(Matrix,

digits=3)

Cramer V

0.439

### Note: k = 2 for this table,

### as the minimum categories in one dimension is 2.

### Fisher Exact test of association

fisher.test(Matrix)

Fisher's Exact Test for Count Data

p-value = 0.000668

alternative hypothesis: two.sided

#### Post-hoc analysis

Post-hoc analysis can be conducted with pairwise Fisher’s
exact tests. The function *pairwiseNominalIndependence* in the *rcompanion*
package can be used to conduct this analysis.

### Order matrix

Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal",
"Cobblestone")),]

Matrix

### Pairwise tests of association

library(rcompanion)

PT = pairwiseNominalIndependence(Matrix,

compare = "row",

fisher = TRUE,

gtest = FALSE,

chisq = FALSE,

method = "fdr", # see ?p.adjust for options

digits = 3)

PT

Comparison p.Fisher p.adj.Fisher

1 Heimlich : Bloom 0.740000 0.74000

2 Heimlich : Dougal 0.013100 0.02620

3 Heimlich : Cobblestone 0.000994 0.00596

4 Bloom : Dougal 0.037600 0.05640

5 Bloom : Cobblestone 0.003960 0.01190

6 Dougal : Cobblestone 0.720000 0.74000

### Compact letter display

library(rcompanion)

cldList(p.adj.Fisher ~ Comparison,

data = PT,

threshold = 0.05)

Group Letter MonoLetter

1 Heimlich a a

2 Bloom ab ab

3 Dougal bc bc

4 Cobblestone c c

The table of adjusted *p*-values can be summarized to a
table of letters indicating which treatments are not significantly different.

County Percent passing Letter

Heimlich County 84% a

Bloom County 81 ab

Dougal County 47 bc

Cobblestone County 35 c

Counties sharing a letter are not significantly different by Fisher exact test,
with p-values adjusted by FDR method for multiple comparisons (Benjamini–Hochberg
false discovery rate).

This table of letters can also be found using the *pairwiseNominalMatrix*
function along with the *multcompLetters* function in the *multcompView*
package.

### Order matrix

Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal",
"Cobblestone")),]

Matrix

### Pairwise tests of association

library(rcompanion)

PM = pairwiseNominalMatrix(Matrix,

compare = "row",

fisher = TRUE,

gtest = FALSE,

chisq = FALSE,

method = "fdr", #
see ?p.adjust for options

digits = 3)

PM

$Test

[1] "Fisher exact test"

$Method

[1] "fdr"

$Adjusted

Heimlich Bloom Dougal Cobblestone

Heimlich 1.00000 0.7400 0.0262 0.00596

Bloom 0.74000 1.0000 0.0564 0.01190

Dougal 0.02620 0.0564 1.0000 0.74000

Cobblestone 0.00596 0.0119 0.7400 1.00000

library(multcompView)

multcompLetters(PM$Adjusted,

compare="<",

threshold=0.05, ### p-value to use
as significance threshold

Letters=letters,

reversed = FALSE)

Heimlich Bloom Dougal Cobblestone

"a" "ab" "bc"
"c"

### G-test of association

library(DescTools)

GTest(Matrix)

Log likelihood ratio (G-test) test of independence without correction

G = 17.14, X-squared df = 3, p-value = 0.0006615

#### Post-hoc analysis

### Order matrix

Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal",
"Cobblestone")),]

Matrix

### Pairwise tests of association

pairwiseNominalIndependence(Matrix,

compare = "row",

fisher = FALSE,

gtest = TRUE,

chisq = FALSE,

method = "fdr", # see ?p.adjust for options

digits = 3)

Comparison p.Gtest p.adj.Gtest

1 Heimlich : Bloom 0.718000 0.71800

2 Heimlich : Dougal 0.008300 0.01660

3 Heimlich : Cobblestone 0.000506 0.00304

4 Bloom : Dougal 0.024800 0.03720

5 Bloom : Cobblestone 0.002380 0.00714

6 Dougal : Cobblestone 0.513000 0.61600

The table of adjusted *p*-values can be summarized to a
table of letters indicating which treatments are not significantly different.

County Percent passing Letter

Heimlich County 84% a

Bloom County 81 a

Dougal County 47 b

Cobblestone County 35 b

Counties sharing a letter are not significantly different by G-test for association,
with p-values adjusted by FDR method for multiple comparisons
(Benjamini–Hochberg false discovery rate).

### Chi-square test of association

chisq.test(Matrix)

Pearson's Chi-squared test

X-squared = 17.32, df = 3, p-value = 0.0006072

#### Post-hoc analysis

### Order matrix

Matrix = Matrix[(c("Heimlich", "Bloom", "Dougal",
"Cobblestone")),]

Matrix

### Pairwise tests of association

library(rcompanion)

pairwiseNominalIndependence(Matrix,

compare = "row",

fisher = FALSE,

gtest = FALSE,

chisq = TRUE,

method = "fdr", # see ?p.adjust for options

digits = 3)

Comparison p.Chisq p.adj.Chisq

1 Heimlich : Bloom 0.99000 0.99000

2 Heimlich : Dougal 0.01910 0.03820

3 Heimlich : Cobblestone 0.00154 0.00924

4 Bloom : Dougal 0.05590 0.08380

5 Bloom : Cobblestone 0.00707 0.02120

6 Dougal : Cobblestone 0.77000 0.92400

The table of adjusted *p*-values can be summarized to a
table of letters indicating which treatments are not significantly different.

County Percent passing Letter

Heimlich County 84% a

Bloom County 81 ab

Dougal County 47 bc

Cobblestone County 35 c

Counties sharing a letter are not significantly different by chi-square test of
association, with p-values adjusted by FDR method for multiple comparisons
(Benjamini–Hochberg false discovery rate).

### Optional readings

in McDonald, J.H. 2014.

“Small numbers in chi-square and G–tests”

*Handbook of Biological Statistics*. www.biostathandbook.com/small.html.

### References

“Fisher’s Exact Test of Independence” in Mangiafico, S.S.
2015a. *An R Companion for the Handbook of Biological Statistics*, version
1.09. rcompanion.org/rcompanion/b_07.html.

“G–test of Independence” in Mangiafico, S.S. 2015b. *An R
Companion for the Handbook of Biological Statistics*, version 1.09. rcompanion.org/rcompanion/b_06.html.

“Chi-square Test of Independence” in Mangiafico, S.S. 2015c.
*An R Companion for the Handbook of Biological Statistics*, version 1.09. rcompanion.org/rcompanion/b_05.html.

“Small Numbers in Chi-square and G–tests” in Mangiafico,
S.S. 2015d. *An R Companion for the Handbook of Biological Statistics*,
version 1.09. rcompanion.org/rcompanion/b_08.html.

### Exercises M

1. Considering Alexander Anderson’s data,

a. Numerically, which county had the greatest percentage of
passing students?

b. Numerically, which county had the greatest number of total
students?

c. Was there an association between county and student
success? Report the test used, why you chose this test, the *p*-value, and
the conclusion.

d. What was the effect size? What statistic was used? What
is the interpretation of this value?

e. Statistically, which counties performed the best? Be sure to indicate which post-hoc test you are using.

f. Statistically, which performed the worst?

g. Plot the data in an appropriate way, and submit the plot.

h. Practically speaking, what are your conclusions? Consider all the information above that is relevant, and include descriptive statistics and observations about your plot that support your conclusions.

2. Ryuk and Rem held a workshop on planting habitat for pollinators like bees and butterflies. They wish to know if there is an association between the profession of the attendees and their willingness to undertake a conservation planting. The following are the data.

Will plant?

Profession Yes No

Homeowner 13 14

Landscaper 27 6

Farmer 7 19

NGO 6 6

For each of the following, answer the question, and ** show
the output from the analyses you used to answer the question**.

a. Numerically, which profession had the greatest percentage
of answering yes?

b. Numerically, which profession had the greatest number of
total attendees?

c. Was there an association between profession and willingness
to undertake a conservation planting? Report the test used, why you chose this
test, the *p*-value, and the conclusion.

d. What was the effect size? What statistic was used? What
is the interpretation of this value?

e. Statistically, which professions were most willing? Be sure to indicate which post-hoc test you are using.

f. Statistically, which were least willing?

g. Plot the data in an appropriate way, and submit the plot.

h. Practically speaking, what are your conclusions? Consider all the information above that is relevant, and include descriptive statistics and observations about your plot that support your conclusions.