The Chi-square test of independence can be performed with the chisq.test function in the native stats package in R. For this test, the function requires the contingency table to be in the form of matrix. Depending on the form of the data to begin with, this can require an extra step, either combing vectors into a matrix, or cross-tabulating the counts among factors in a data frame. None of this is too difficult, but it requires following the correct example depending on the initial form of the data.
When using read.table and as.matrix to read a table directly as a matrix, be careful of extra spaces at the end of lines or extraneous characters in the table, as these can cause errors.
Examples in Summary and Analysis of Extension
Program Evaluation
SAEPER: Association Tests for Nominal Variables
Packages used in this chapter
The following commands will install these packages if they are not already installed:
if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(grid)){install.packages("grid")}
if(!require(pwr)){install.packages("pwr")}
When to use it
Example of chi-square test with matrix created with read.table
###
--------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59–60
### Example directly reading a table as a matrix
### --------------------------------------------------------------
Input =("
Injection.area No.severe Severe
Thigh 4788 30
Arm 8916 76
")
Matriz = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
Matriz
No.severe Severe
Thigh 4788 30
Arm 8916 76
chisq.test(Matriz,
correct=TRUE) # Continuity
correction for 2 x 2
# table
Pearson's Chi-squared test with Yates' continuity correction
X-squared = 1.7579, df = 1, p-value = 0.1849
chisq.test(Matriz,
correct=FALSE) # No continuity
correction for 2 x 2
# table
Pearson's Chi-squared test
X-squared = 2.0396, df = 1, p-value = 0.1533
# # #
Example of chi-square test with matrix created by combining vectors
###
--------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59–60
### Example creating a matrix from vectors
### --------------------------------------------------------------
R1 = c(4788, 30)
R2 = c(8916, 76)
rows = 2
Matriz = matrix(c(R1, R2),
nrow=rows,
byrow=TRUE)
rownames(Matriz) = c("Thigh", "Arm") # Naming the rows and
colnames(Matriz) = c("No.severe", "Severe") # columns is optional.
Matriz
No.severe Severe
Thigh 4788 30
Arm 8916 76
chisq.test(Matriz,
correct=TRUE) # Continuity
correction for 2 x 2
# table
Pearson's Chi-squared test with Yates' continuity correction
X-squared = 1.7579, df = 1, p-value = 0.1849
chisq.test(Matriz,
correct=FALSE) # No continuity
correction for 2 x 2
# table
Pearson's Chi-squared test
X-squared = 2.0396, df = 1, p-value = 0.1533
# # #
See the Handbook for information on these topics.
Post-hoc tests
For the following example of post-hoc pairwise testing, we’ll use the pairwiseNominalIndependence function from the package rcompanion to make the task easier. Then we’ll use pairwise.table in the native stats package as an alternative.
Post-hoc pairwise chi-square tests with
rcompanion
###
--------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60–61
### --------------------------------------------------------------
Input =("
Supplement No.cancer Cancer
'Selenium' 8177 575
'Vitamin E' 8117 620
'Selenium+E' 8147 555
'Placebo' 8167 529
")
Matriz = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
Matriz
chisq.test(Matriz)
X-squared = 7.7832, df = 3, p-value = 0.05071
library(rcompanion)
pairwiseNominalIndependence(Matriz,
fisher = FALSE,
gtest = FALSE,
chisq = TRUE,
method = "fdr")
Comparison p.Chisq
p.adj.Chisq
1 Selenium :
Vitamin E 0.17700 0.2960
2
Selenium : Selenium+E 0.62800
0.6280
3 Selenium :
Placebo 0.19700 0.2960
4
Vitamin E : Selenium+E 0.06260
0.1880
5 Vitamin E :
Placebo 0.00771 0.0463
6
Selenium+E : Placebo 0.44000
0.5280
Post-hoc pairwise chi-square tests with pairwise.table
###
--------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60–61
### As is, this code works on a matrix with two columns,
### and compares rows
### --------------------------------------------------------------
Input =("
Supplement No.cancer Cancer
'Selenium' 8177 575
'Vitamin E' 8117 620
'Selenium+E' 8147 555
'Placebo' 8167 529
")
Matriz = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
Matriz
chisq.test(Matriz)
X-squared = 7.7832, df = 3, p-value = 0.05071
FUN = function(i,j){
chisq.test(matrix(c(Matriz[i,1], Matriz[i,2],
Matriz[j,1], Matriz[j,2]),
nrow=2,
byrow=TRUE))$ p.value
}
pairwise.table(FUN,
rownames(Matriz),
p.adjust.method="none")
# Can adjust p-values;
# see ?p.adjust for options
Selenium Vitamin.E Selenium.and.E
Vitamin.E 0.1772113 NA NA
Selenium.and.E 0.6277621 0.062588260 NA
Placebo 0.1973435 0.007705529 0.4398677
# # #
See the Handbook for information on this topic.
Examples
Chi-square test of independence with continuity correction and without correction
###
--------------------------------------------------------------
### Helmet example, Chi-square independence, p. 63
### --------------------------------------------------------------
Input =("
PSE Head.injury Other.injury
Helemt 372 4715
No.helmet 267 1391
")
Matriz = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
Matriz
chisq.test(Matriz,
correct=TRUE) # Continuity
correction for 2 x 2
# table
Pearson's Chi-squared test with Yates' continuity correction
X-squared = 111.6569, df = 1, p-value < 2.2e-16
chisq.test(Matriz,
correct=FALSE) # No continuity
correction for 2 x 2
# table
Pearson's Chi-squared test
X-squared = 112.6796, df = 1, p-value < 2.2e-16
# # #
Chi-square test of independence
### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
### p. 63
### --------------------------------------------------------------
Input =("
Genotype No.disease Coronary.disease
'ins/ins' 268 807
'ins/del' 199 759
'del/del' 42 184
")
Matriz = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
Matriz
chisq.test(Matriz)
Pearson's Chi-squared test
X-squared = 7.2594, df = 2, p-value = 0.02652
# # #
Graphing the results
The first plot below is a bar char with confidence intervals, with a style typical of the ggplot2 package. The second plot is somewhat more similar to the style of the plot in the Handbook.
For each example, the code calculates proportions or confidence intervals. This code could be skipped if those values were determined manually and put in to a data frame from which the plot could be generated.
Sometimes factors will need to have the order of their levels specified for ggplot2 to put them in the correct order on the plot. Otherwise R will alphabetize levels.
Simple bar plot with error bars showing confidence intervals
### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square test of association,
### pp. 63–64
### --------------------------------------------------------------
Input =("
Supplement No.cancer Cancer
'Selenium' 8177 575
'Vitamin E' 8117 620
'Selenium+E' 8147 555
'Placebo' 8167 529
")
Prostate = read.table(textConnection(Input),header=TRUE)
### Add sums and confidence intervals
library(dplyr)
Prostate =
mutate(Prostate,
Sum = No.cancer + Cancer)
Prostate =
mutate(Prostate,
Prop = Cancer / Sum,
low.ci = apply(Prostate[c("Cancer", "Sum")], 1,
function(y) binom.test(y['Cancer'], y['Sum'])$ conf.int[1]),
high.ci = apply(Prostate[c("Cancer", "Sum")], 1,
function(y) binom.test(y['Cancer'], y['Sum'])$ conf.int[2])
)
Prostate
Supplement No.cancer Cancer Sum Prop low.ci high.ci
1 Selenium 8177 575 8752 0.06569927 0.06059677 0.07109314
2 Vitamin E 8117 620 8737 0.07096257 0.06566518 0.07654816
3 Selenium+E 8147 555 8702 0.06377844 0.05873360 0.06911770
4 Placebo 8167 529 8696 0.06083257 0.05589912 0.06606271
### Plot (Bar chart plot)
library(ggplot2)
ggplot(Prostate,
aes(x=Supplement, y=Prop)) +
geom_bar(stat="identity", fill="gray40",
colour="black", size=0.5,
width=0.7) +
geom_errorbar(aes(ymax=high.ci, ymin=low.ci),
width=0.2, size=0.5, color="black") +
xlab("Supplement") +
ylab("Prostate cancer proportion") +
scale_x_discrete(labels=c("Selenium", "Vitamin E",
"Selenium+E","Placebo")) +
## ggtitle("Main title") +
theme(axis.title=element_text(size=14, color="black",
face="bold", vjust=3)) +
theme(axis.text = element_text(size=12, color = "gray25",
face="bold")) +
theme(axis.title.y = element_text(vjust= 1.8)) +
theme(axis.title.x = element_text(vjust= -0.5))
# # #
Bar plot of proportions vs. categories. Error bars indicate 95% confidence intervals for observed proportion.
Bar plot with categories and no error bars
### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square independence,
### p. 64
### --------------------------------------------------------------
Input =("
Habitat Bird Count
Vegetation Heron 15
Shoreline Heron 20
Water Heron 14
Structures Heron 6
Vegetation Egret 8
Shoreline Egret 5
Water Egret 7
Structures Egret 1
")
Birds = read.table(textConnection(Input),header=TRUE)
### Specify the order of factor levels
library(dplyr)
Birds=
mutate(Birds,
Habitat = factor(Habitat,levels=unique(Habitat)),
Bird = factor(Bird,levels=unique(Bird))
)
### Add sums and proportions
Birds$ Sum[Birds$ Bird == 'Heron'] =
sum(Birds$ Count[Birds$ Bird == 'Heron'])
Birds$ Sum[Birds$ Bird == 'Egret'] =
sum(Birds$ Count[Birds$ Bird == 'Egret'])
Birds=
mutate(Birds,
prop = Count / Sum
)
Birds
Habitat Bird Count Sum prop
1 Vegetation Heron 15 55 0.27272727
2 Shoreline Heron 20 55 0.36363636
3 Water Heron 14 55 0.25454545
4 Structures Heron 6 55 0.10909091
5 Vegetation Egret 8 21 0.38095238
6 Shoreline Egret 5 21 0.23809524
7 Water Egret 7 21 0.33333333
8 Structures Egret 1 21 0.04761905
### Plot adapted from:
### shinyapps.stat.ubc.ca/r-graph-catalog/
library(ggplot2)
library(grid)
ggplot(Birds,
aes(x = Habitat, y = prop, fill = Bird, ymax=0.40, ymin=0)) +
geom_bar(stat="identity", position = "dodge", width =
0.7) +
geom_bar(stat="identity", position = "dodge", colour =
"black",
width = 0.7, show_guide = FALSE) +
scale_y_continuous(breaks = seq(0, 0.40, 0.05),
limits = c(0, 0.40),
expand = c(0, 0)) +
scale_fill_manual(name = "Bird type" ,
values = c('grey80', 'grey30'),
labels = c("Heron (all types)",
"Egret (all types)")) +
## geom_errorbar(position=position_dodge(width=0.7),
## width=0.0, size=0.5, color="black") +
labs(x = "Habitat Location", y = "Landing site
proportion") +
## ggtitle("Main title") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey50"),
plot.title = element_text(size = rel(1.5),
face = "bold", vjust = 1.5),
axis.title = element_text(face = "bold"),
legend.position = "top",
legend.title = element_blank(),
legend.key.size = unit(0.4, "cm"),
legend.key = element_rect(fill = "black"),
axis.title.y = element_text(vjust= 1.8),
axis.title.x = element_text(vjust= -0.5)
)
# # #
See the Handbook for information on these topics. Fisher’s exact test, G-test, and McNemar’s test are discussed elsewhere in this book.
How to do the test
Chi-square test of independence with data as a data frame
In the following example for the chi-square test of independence, the data is read in as a data frame, not as a matrix as in previous examples. This allows more flexibility with how data are entered. For example you could have counts for same genotype and health distributed among several lines, or have a count of 1 for each row, with a separate row for each individual observation. The xtabs function is used to tabulate the data and convert them to a contingency table.
### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
### SAS example, pp. 65–66
### Example using cross-tabulation
### --------------------------------------------------------------
Input =("
Genotype Health Count
ins-ins no_disease 268
ins-ins disease 807
ins-del no_disease 199
ins-del disease 759
del-del no_disease 42
del-del disease 184
")
Data.frame = read.table(textConnection(Input),header=TRUE)
### Cross-tabulate the data
Data.xtabs = xtabs(Count ~ Genotype + Health,
data=Data.frame)
Data.xtabs
Health
Genotype disease no_disease
del-del 184 42
ins-del 759 199
ins-ins 807 268
summary(Data.xtabs) # includes N and factors
Number of cases in table: 2259
Number of factors: 2
### Chi-square test of independence
chisq.test(Data.xtabs)
X-squared = 7.2594, df = 2, p-value = 0.02652
# # #
Power analysis
Power analysis for chi-square test of independence
###
--------------------------------------------------------------
### Power analysis, chi-square independence, pp. 66–67
### --------------------------------------------------------------
# This example assumes you are using a Chi-square
test of
# independence. The example in the Handbook appears to use
# a Chi-square goodness-of-fit test
# In the pwr package, for the Chi-square test of independence,
# the table probabilities should sum to 1
Input =("
Genotype No.cancer Cancer
GG 0.18 0.165
GA 0.24 0.225
AA 0.08 0.110
")
P = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
P
No.cancer Cancer
GG 0.18 0.165
GA 0.24 0.225
AA 0.08 0.110
sum(P) # Sum of values in the P matrix
[1] 1
library(pwr)
effect.size = ES.w2(P)
degrees = (nrow(P)-1)*(ncol(P)-1) # Calculate
degrees of freedom
pwr.chisq.test(
w=effect.size,
N=NULL, # Total number of
observations
df=degrees,
power=0.80, # 1 minus Type II
probability
sig.level=0.05) # Type I probability
w = 0.07663476 # Answer differs significantly
N = 1640.537 # from Handbook
df = 2 # Total observations
sig.level = 0.05
power = 0.8
# # #