R Companion: Chi-square Test of Independence

The Chi-square test of independence can be performed with the chisq.test function in the native stats package in R. For this test, the function requires the contingency table to be in the form of matrix. Depending on the form of the data to begin with, this can require an extra step, either combing vectors into a matrix, or cross-tabulating the counts among factors in a data frame. None of this is too difficult, but it requires following the correct example depending on the initial form of the data.

When using read.table and as.matrix to read a table directly as a matrix, be careful of extra spaces at the end of lines or extraneous characters in the table, as these can cause errors.

Examples in Summary and Analysis of Extension Program Evaluation

SAEPER: Association Tests for Nominal Variables

Packages used in this chapter

The following commands will install these packages if they are not already installed:

if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(grid)){install.packages("grid")}
if(!require(pwr)){install.packages("pwr")}

When to use it

Example of chi-square test with matrix created with read.table

### --------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59–60
###      Example directly reading a table as a matrix
### --------------------------------------------------------------

Input =("
Injection.area No.severe Severe
Thigh           4788       30
Arm             8916       76
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

No.severe Severe

Thigh 4788 30

Arm 8916 76

chisq.test(Matriz,
correct=TRUE) # Continuity correction for 2 x 2
# table

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 1.7579, df = 1, p-value = 0.1849

chisq.test(Matriz,
correct=FALSE) # No continuity correction for 2 x 2
# table

Pearson's Chi-squared test

X-squared = 2.0396, df = 1, p-value = 0.1533

# # #

Example of chi-square test with matrix created by combining vectors

### --------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59–60
###      Example creating a matrix from vectors
### --------------------------------------------------------------

R1 = c(4788, 30)
R2 = c(8916, 76)

rows   = 2

Matriz = matrix(c(R1, R2),
              nrow=rows,
                byrow=TRUE)

rownames(Matriz) = c("Thigh", "Arm")          # Naming the rows and
colnames(Matriz) = c("No.severe", "Severe")   # columns is optional.

Matriz

No.severe Severe

Thigh 4788 30

Arm 8916 76

chisq.test(Matriz,
correct=TRUE) # Continuity correction for 2 x 2
# table

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 1.7579, df = 1, p-value = 0.1849

chisq.test(Matriz,
correct=FALSE) # No continuity correction for 2 x 2
# table

Pearson's Chi-squared test

X-squared = 2.0396, df = 1, p-value = 0.1533

# # #

Null hypothesis

How the test works

See the Handbook for information on these topics.

Post-hoc tests

For the following example of post-hoc pairwise testing, we’ll use the pairwiseNominalIndependence function from the package rcompanion to make the task easier. Then we’ll use pairwise.table in the native stats package as an alternative.

Post-hoc pairwise chi-square tests with rcompanion

### --------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60–61
### --------------------------------------------------------------

Input =("
Supplement     No.cancer Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

chisq.test(Matriz)

X-squared = 7.7832, df = 3, p-value = 0.05071

library(rcompanion)

pairwiseNominalIndependence(Matriz,
                            fisher = FALSE,
                            gtest = FALSE,
                            chisq = TRUE,
                            method = "fdr")

              Comparison p.Chisq p.adj.Chisq
1   Selenium : Vitamin E 0.17700      0.2960
2 Selenium : Selenium+E 0.62800      0.6280
3     Selenium : Placebo 0.19700     0.2960
4 Vitamin E : Selenium+E 0.06260      0.1880
5    Vitamin E : Placebo 0.00771      0.0463
6   Selenium+E : Placebo 0.44000      0.5280

Post-hoc pairwise chi-square tests with pairwise.table

### --------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60–61
### As is, this code works on a matrix with two columns,
###   and compares rows
### --------------------------------------------------------------

Input =("
Supplement     No.cancer Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

chisq.test(Matriz)

X-squared = 7.7832, df = 3, p-value = 0.05071

FUN = function(i,j){
      chisq.test(matrix(c(Matriz[i,1], Matriz[i,2],
                          Matriz[j,1], Matriz[j,2]),
                 nrow=2,
                 byrow=TRUE))$ p.value
                }

pairwise.table(FUN,
               rownames(Matriz),
               p.adjust.method="none")

                           # Can adjust p-values;
                           # see ?p.adjust for options

Selenium Vitamin.E Selenium.and.E

Vitamin.E 0.1772113 NA NA

Selenium.and.E 0.6277621 0.062588260 NA

Placebo 0.1973435 0.007705529 0.4398677

# # #

Assumptions

See the Handbook for information on this topic.

Examples

Chi-square test of independence with continuity correction and without correction

### --------------------------------------------------------------
### Helmet example, Chi-square independence, p. 63
### --------------------------------------------------------------

Input =("
PSE       Head.injury Other.injury
Helemt     372          4715
No.helmet 267          1391
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

chisq.test(Matriz,
           correct=TRUE)       # Continuity correction for 2 x 2
                               #      table

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 111.6569, df = 1, p-value < 2.2e-16

chisq.test(Matriz,
correct=FALSE) # No continuity correction for 2 x 2
# table

Pearson's Chi-squared test

X-squared = 112.6796, df = 1, p-value < 2.2e-16

# # #

Chi-square test of independence

### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
###   p. 63
### --------------------------------------------------------------

Input =("
Genotype No.disease Coronary.disease
'ins/ins'   268        807
'ins/del'   199        759
'del/del'    42        184
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

chisq.test(Matriz)

Pearson's Chi-squared test

X-squared = 7.2594, df = 2, p-value = 0.02652

# # #

Graphing the results

The first plot below is a bar char with confidence intervals, with a style typical of the ggplot2 package. The second plot is somewhat more similar to the style of the plot in the Handbook.

For each example, the code calculates proportions or confidence intervals. This code could be skipped if those values were determined manually and put in to a data frame from which the plot could be generated.

Sometimes factors will need to have the order of their levels specified for ggplot2 to put them in the correct order on the plot. Otherwise R will alphabetize levels.

Simple bar plot with error bars showing confidence intervals

### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square test of association,
###   pp. 63–64
### --------------------------------------------------------------

Input =("
Supplement     No.cancer Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

Prostate = read.table(textConnection(Input),header=TRUE)

### Add sums and confidence intervals

library(dplyr)

Prostate =
mutate(Prostate,
       Sum = No.cancer + Cancer)

Prostate =
mutate(Prostate,
       Prop = Cancer / Sum,
       low.ci = apply(Prostate[c("Cancer", "Sum")], 1,
                function(y) binom.test(y['Cancer'], y['Sum'])$ conf.int[1]),
       high.ci = apply(Prostate[c("Cancer", "Sum")], 1,
                function(y) binom.test(y['Cancer'], y['Sum'])$ conf.int[2])
       )

Prostate

Supplement No.cancer Cancer Sum Prop low.ci high.ci

1 Selenium 8177 575 8752 0.06569927 0.06059677 0.07109314

2 Vitamin E 8117 620 8737 0.07096257 0.06566518 0.07654816

3 Selenium+E 8147 555 8702 0.06377844 0.05873360 0.06911770

4 Placebo 8167 529 8696 0.06083257 0.05589912 0.06606271

### Plot (Bar chart plot)

library(ggplot2)

ggplot(Prostate,
aes(x=Supplement, y=Prop)) +
geom_bar(stat="identity", fill="gray40",
          colour="black", size=0.5,
          width=0.7) +
geom_errorbar(aes(ymax=high.ci, ymin=low.ci),
                   width=0.2, size=0.5, color="black") +
xlab("Supplement") +
ylab("Prostate cancer proportion") +
scale_x_discrete(labels=c("Selenium", "Vitamin E",
                           "Selenium+E","Placebo")) +
## ggtitle("Main title") +
theme(axis.title=element_text(size=14, color="black",
                               face="bold", vjust=3)) +
theme(axis.text = element_text(size=12, color = "gray25",
                               face="bold")) +
theme(axis.title.y = element_text(vjust= 1.8)) +
theme(axis.title.x = element_text(vjust= -0.5))

# # #

Rplot

Bar plot of proportions vs. categories. Error bars indicate 95% confidence intervals for observed proportion.

Bar plot with categories and no error bars

### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square independence,
###   p. 64
### --------------------------------------------------------------

Input =("
Habitat      Bird   Count
Vegetation   Heron   15
Shoreline    Heron   20
Water        Heron   14
Structures   Heron    6
Vegetation   Egret    8
Shoreline    Egret    5
Water        Egret    7
Structures   Egret    1
")

Birds = read.table(textConnection(Input),header=TRUE)

### Specify the order of factor levels

library(dplyr)

Birds=
mutate(Birds,
       Habitat = factor(Habitat,levels=unique(Habitat)),
       Bird = factor(Bird,levels=unique(Bird))
       )


### Add sums and proportions

Birds$ Sum[Birds$ Bird == 'Heron'] =
       sum(Birds$ Count[Birds$ Bird == 'Heron'])

Birds$ Sum[Birds$ Bird == 'Egret'] =
       sum(Birds$ Count[Birds$ Bird == 'Egret'])

Birds=
mutate(Birds,
       prop = Count / Sum
       )

Birds

Habitat Bird Count Sum prop

1 Vegetation Heron 15 55 0.27272727

2 Shoreline Heron 20 55 0.36363636

3 Water Heron 14 55 0.25454545

4 Structures Heron 6 55 0.10909091

5 Vegetation Egret 8 21 0.38095238

6 Shoreline Egret 5 21 0.23809524

7 Water Egret 7 21 0.33333333

8 Structures Egret 1 21 0.04761905

### Plot adapted from:
### shinyapps.stat.ubc.ca/r-graph-catalog/

library(ggplot2)
library(grid)

ggplot(Birds,
aes(x = Habitat, y = prop, fill = Bird, ymax=0.40, ymin=0)) +
geom_bar(stat="identity", position = "dodge", width = 0.7) +
geom_bar(stat="identity", position = "dodge", colour = "black",
           width = 0.7, show_guide = FALSE) +
scale_y_continuous(breaks = seq(0, 0.40, 0.05),
                     limits = c(0, 0.40),
                     expand = c(0, 0)) +
scale_fill_manual(name = "Bird type" ,
                    values = c('grey80', 'grey30'),
                    labels = c("Heron (all types)",
                               "Egret (all types)")) +
## geom_errorbar(position=position_dodge(width=0.7),
##               width=0.0, size=0.5, color="black") +
labs(x = "Habitat Location", y = "Landing site proportion") +
## ggtitle("Main title") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(colour = "grey50"),
        plot.title = element_text(size = rel(1.5),
                                  face = "bold", vjust = 1.5),
        axis.title = element_text(face = "bold"),
        legend.position = "top",
        legend.title = element_blank(),
        legend.key.size = unit(0.4, "cm"),
        legend.key = element_rect(fill = "black"),
        axis.title.y = element_text(vjust= 1.8),
        axis.title.x = element_text(vjust= -0.5)
       )

# # #

Rplot

Similar tests

Chi-square vs. G–test

See the Handbook for information on these topics. Fisher’s exact test, G-test, and McNemar’s test are discussed elsewhere in this book.

How to do the test

Chi-square test of independence with data as a data frame

In the following example for the chi-square test of independence, the data is read in as a data frame, not as a matrix as in previous examples. This allows more flexibility with how data are entered. For example you could have counts for same genotype and health distributed among several lines, or have a count of 1 for each row, with a separate row for each individual observation. The xtabs function is used to tabulate the data and convert them to a contingency table.

### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
###      SAS example, pp. 65–66
###      Example using cross-tabulation
### --------------------------------------------------------------

Input =("
Genotype Health       Count
ins-ins   no_disease   268
ins-ins   disease      807
ins-del   no_disease   199
ins-del   disease      759
del-del   no_disease    42
del-del   disease      184
")

Data.frame = read.table(textConnection(Input),header=TRUE)

### Cross-tabulate the data

Data.xtabs = xtabs(Count ~ Genotype + Health,
                   data=Data.frame)

Data.xtabs

Health

Genotype disease no_disease

del-del 184 42

ins-del 759 199

ins-ins 807 268

summary(Data.xtabs) # includes N and factors

Number of cases in table: 2259

Number of factors: 2

### Chi-square test of independence

chisq.test(Data.xtabs)

X-squared = 7.2594, df = 2, p-value = 0.02652

# # #

Power analysis

Power analysis for chi-square test of independence

### --------------------------------------------------------------
### Power analysis, chi-square independence, pp. 66–67
### --------------------------------------------------------------

# This example assumes you are using a Chi-square test of
#   independence. The example in the Handbook appears to use
#   a Chi-square goodness-of-fit test

# In the pwr package, for the Chi-square test of independence,
#   the table probabilities should sum to 1

Input =("
Genotype No.cancer Cancer
GG        0.18     0.165
GA        0.24     0.225
AA        0.08     0.110
")

P = as.matrix(read.table(textConnection(Input),
              header=TRUE,
              row.names=1))

P

No.cancer Cancer

GG 0.18 0.165

GA 0.24 0.225

AA 0.08 0.110

sum(P) # Sum of values in the P matrix

[1] 1

library(pwr)

effect.size = ES.w2(P)

degrees = (nrow(P)-1)*(ncol(P)-1) # Calculate degrees of freedom

pwr.chisq.test(
       w=effect.size,
       N=NULL,          # Total number of observations
       df=degrees,
       power=0.80,      # 1 minus Type II probability
       sig.level=0.05)    # Type I probability

w = 0.07663476 # Answer differs significantly

N = 1640.537 # from Handbook

df = 2 # Total observations

sig.level = 0.05

power = 0.8

# # #

An R Companion for the Handbook of Biological Statistics

Chi-square Test of Independence

Example of chi-square test with matrix created with read.table

Post-hoc pairwise chi-square tests with rcompanion

Post-hoc pairwise chi-square tests with pairwise.table