An R Companion for the Handbook of Biological Statistics

Salvatore S. Mangiafico

Chi-square Test of Independence

The Chi-square test of independence can be performed with the chisq.test function in the native stats package in R.  For this test, the function requires the contingency table to be in the form of matrix.  Depending on the form of the data to begin with, this can require an extra step, either combing vectors into a matrix, or cross-tabulating the counts among factors in a data frame.  None of this is too difficult, but it requires following the correct example depending on the initial form of the data.

When using read.table and as.matrix to read a table directly as a matrix, be careful of extra spaces at the end of lines or extraneous characters in the table, as these can cause errors.

Packages used in this chapter

The following commands will install these packages if they are not already installed:

if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(grid)){install.packages("grid")}
if(!require(pwr)){install.packages("pwr")}

When to use it

Example of chi-square test with matrix created with read.table

### --------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59
60
###      Example directly reading a table as a matrix
### --------------------------------------------------------------

Input =("
Injection.area  No.severe  Severe
Thigh           4788       30
Arm             8916       76
")

row.names=1))

Matriz

No.severe Severe

Thigh      4788     30

Arm        8916     76

chisq.test(Matriz,
correct=TRUE)      # Continuity correction for 2 x 2
#      table

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 1.7579, df = 1, p-value = 0.1849

chisq.test(Matriz,
correct=FALSE)      # No continuity correction for 2 x 2
#      table

Pearson's Chi-squared test

X-squared = 2.0396, df = 1, p-value = 0.1533

#     #     #

Example of chi-square test with matrix created by combining vectors

### --------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59
60
###      Example creating a matrix from vectors
### --------------------------------------------------------------

R1 = c(4788, 30)
R2 = c(8916, 76)

rows   = 2

Matriz = matrix(c(R1, R2),
nrow=rows,
byrow=TRUE)

rownames(Matriz) = c("Thigh", "Arm")          # Naming the rows and
colnames(Matriz) = c("No.severe", "Severe")   #  columns is optional.

Matriz

No.severe Severe

Thigh      4788     30

Arm        8916     76

chisq.test(Matriz,
correct=TRUE)      # Continuity correction for 2 x 2
#      table

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 1.7579, df = 1, p-value = 0.1849

chisq.test(Matriz,
correct=FALSE)      # No continuity correction for 2 x 2
#      table

Pearson's Chi-squared test

X-squared = 2.0396, df = 1, p-value = 0.1533

#     #     #

Null hypothesis

How the test works

See the Handbook for information on these topics.

Post-hoc tests

For the following example of post-hoc pairwise testing, we’ll use the pairwiseNominalIndependence function from the package rcompanion to make the task easier.  Then we’ll use pairwise.table in the native stats package as an alternative.

Post-hoc pairwise chi-square tests with rcompanion

### --------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60
61
### --------------------------------------------------------------

Input =("
Supplement     No.cancer  Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

row.names=1))

Matriz

chisq.test(Matriz)

X-squared = 7.7832, df = 3, p-value = 0.05071

library(rcompanion)

pairwiseNominalIndependence(Matriz,
fisher = FALSE,
gtest  = FALSE,
chisq  = TRUE,
method = "fdr")

1   Selenium : Vitamin E 0.17700      0.2960
2  Selenium : Selenium+E 0.62800      0.6280
3     Selenium : Placebo 0.19700      0.2960
4 Vitamin E : Selenium+E 0.06260      0.1880
5    Vitamin E : Placebo 0.00771      0.0463
6   Selenium+E : Placebo 0.44000      0.5280

Post-hoc pairwise chi-square tests with pairwise.table

### --------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60
61
### As is, this code works on a matrix with two columns,
###   and compares rows
### --------------------------------------------------------------

Input =("
Supplement     No.cancer  Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

row.names=1))

Matriz

chisq.test(Matriz)

X-squared = 7.7832, df = 3, p-value = 0.05071

FUN = function(i,j){
chisq.test(matrix(c(Matriz[i,1], Matriz[i,2],
Matriz[j,1], Matriz[j,2]),
nrow=2,
byrow=TRUE))\$ p.value
}

pairwise.table(FUN,
rownames(Matriz),

Selenium   Vitamin.E Selenium.and.E

Vitamin.E      0.1772113          NA             NA

Selenium.and.E 0.6277621 0.062588260             NA

Placebo        0.1973435 0.007705529      0.4398677

#     #     #

Assumptions

See the Handbook for information on this topic.

Examples

Chi-square test of independence with continuity correction and without correction

### --------------------------------------------------------------
### Helmet example, Chi-square independence, p. 63
### --------------------------------------------------------------

Input =("
Helemt     372          4715
No.helmet  267          1391
")

row.names=1))

Matriz

chisq.test(Matriz,
correct=TRUE)       # Continuity correction for 2 x 2
#      table

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 111.6569, df = 1, p-value < 2.2e-16

chisq.test(Matriz,
correct=FALSE)      # No continuity correction for 2 x 2
#      table

Pearson's Chi-squared test

X-squared = 112.6796, df = 1, p-value < 2.2e-16

#     #     #

Chi-square test of independence

### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
###   p. 63
### --------------------------------------------------------------

Input =("
Genotype  No.disease Coronary.disease
'ins/ins'   268        807
'ins/del'   199        759
'del/del'    42        184
")

row.names=1))

Matriz

chisq.test(Matriz)

Pearson's Chi-squared test

X-squared = 7.2594, df = 2, p-value = 0.02652

#     #     #

Graphing the results

The first plot below is a bar char with confidence intervals, with a style typical of the ggplot2 package.  The second plot is somewhat more similar to the style of the plot in the Handbook

For each example, the code calculates proportions or confidence intervals.  This code could be skipped if those values were determined manually and put in to a data frame from which the plot could be generated.

Sometimes factors will need to have the order of their levels specified for ggplot2 to put them in the correct order on the plot.  Otherwise R will alphabetize levels.

Simple bar plot with error bars showing confidence intervals

### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square test of association,
###   pp. 63
64
### --------------------------------------------------------------

Input =("
Supplement     No.cancer  Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

### Add sums and confidence intervals

library(dplyr)

Prostate =
mutate(Prostate,
Sum = No.cancer + Cancer)

Prostate =
mutate(Prostate,
Prop = Cancer / Sum,
low.ci = apply(Prostate[c("Cancer", "Sum")], 1,
function(y) binom.test(y['Cancer'], y['Sum'])\$ conf.int[1]),
high.ci = apply(Prostate[c("Cancer", "Sum")], 1,
function(y) binom.test(y['Cancer'], y['Sum'])\$ conf.int[2])
)

Prostate

Supplement No.cancer Cancer  Sum       Prop     low.ci    high.ci

1   Selenium      8177    575 8752 0.06569927 0.06059677 0.07109314

2  Vitamin E      8117    620 8737 0.07096257 0.06566518 0.07654816

3 Selenium+E      8147    555 8702 0.06377844 0.05873360 0.06911770

4    Placebo      8167    529 8696 0.06083257 0.05589912 0.06606271

### Plot (Bar chart plot)

library(ggplot2)

ggplot(Prostate,
aes(x=Supplement, y=Prop)) +
geom_bar(stat="identity", fill="gray40",
colour="black", size=0.5,
width=0.7) +
geom_errorbar(aes(ymax=high.ci, ymin=low.ci),
width=0.2, size=0.5, color="black") +
xlab("Supplement") +
ylab("Prostate cancer proportion") +
scale_x_discrete(labels=c("Selenium", "Vitamin E",
"Selenium+E","Placebo")) +
## ggtitle("Main title") +
theme(axis.title=element_text(size=14, color="black",
face="bold", vjust=3)) +
theme(axis.text = element_text(size=12, color = "gray25",
face="bold")) +
theme(axis.title.y = element_text(vjust= 1.8)) +
theme(axis.title.x = element_text(vjust= -0.5))

#     #     #

Bar plot of proportions vs. categories.  Error bars indicate 95% confidence intervals for  observed proportion.

Bar plot with categories and no error bars

### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square independence,
###   p. 64
### --------------------------------------------------------------

Input =("
Habitat      Bird   Count
Vegetation   Heron   15
Shoreline    Heron   20
Water        Heron   14
Structures   Heron    6
Vegetation   Egret    8
Shoreline    Egret    5
Water        Egret    7
Structures   Egret    1
")

### Specify the order of factor levels

library(dplyr)

Birds=
mutate(Birds,
Habitat = factor(Habitat,levels=unique(Habitat)),
Bird = factor(Bird,levels=unique(Bird))
)

Birds\$ Sum[Birds\$ Bird == 'Heron'] =
sum(Birds\$ Count[Birds\$ Bird == 'Heron'])

Birds\$ Sum[Birds\$ Bird == 'Egret'] =
sum(Birds\$ Count[Birds\$ Bird == 'Egret'])

Birds=
mutate(Birds,
prop = Count / Sum
)

Birds

Habitat  Bird Count Sum       prop

1 Vegetation Heron    15  55 0.27272727

2  Shoreline Heron    20  55 0.36363636

3      Water Heron    14  55 0.25454545

4 Structures Heron     6  55 0.10909091

5 Vegetation Egret     8  21 0.38095238

6  Shoreline Egret     5  21 0.23809524

7      Water Egret     7  21 0.33333333

8 Structures Egret     1  21 0.04761905

### shinyapps.stat.ubc.ca/r-graph-catalog/

library(ggplot2)
library(grid)

ggplot(Birds,
aes(x = Habitat, y = prop, fill = Bird, ymax=0.40, ymin=0)) +
geom_bar(stat="identity", position = "dodge", width = 0.7) +
geom_bar(stat="identity", position = "dodge", colour = "black",
width = 0.7, show_guide = FALSE) +
scale_y_continuous(breaks = seq(0, 0.40, 0.05),
limits = c(0, 0.40),
expand = c(0, 0)) +
scale_fill_manual(name = "Bird type" ,
values = c('grey80', 'grey30'),
labels = c("Heron (all types)",
"Egret (all types)")) +
## geom_errorbar(position=position_dodge(width=0.7),
##               width=0.0, size=0.5, color="black") +
labs(x = "Habitat Location", y = "Landing site proportion") +
## ggtitle("Main title") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey50"),
plot.title = element_text(size = rel(1.5),
face = "bold", vjust = 1.5),
axis.title = element_text(face = "bold"),
legend.position = "top",
legend.title = element_blank(),
legend.key.size = unit(0.4, "cm"),
legend.key = element_rect(fill = "black"),
axis.title.y = element_text(vjust= 1.8),
axis.title.x = element_text(vjust= -0.5)
)

#     #     #

Similar tests

Chi-square vs. G–test

See the Handbook for information on these topics.  Fisher’s exact test, G-test, and McNemar’s test are discussed elsewhere in this book.

How to do the test

Chi-square test of independence with data as a data frame

In the following example for the chi-square test of independence, the data is read in as a data frame, not as a matrix as in previous examples.  This allows more flexibility with how data are entered.  For example you could have counts for same genotype and health distributed among several lines, or have a count of 1 for each row, with a separate row for each individual observation.  The xtabs function is used to tabulate the data and convert them to a contingency table.

### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
###      SAS example, pp. 65
66
###      Example using cross-tabulation
### --------------------------------------------------------------

Input =("
Genotype  Health       Count
ins-ins   no_disease   268
ins-ins   disease      807
ins-del   no_disease   199
ins-del   disease      759
del-del   no_disease    42
del-del   disease      184
")

###  Cross-tabulate the data

Data.xtabs = xtabs(Count ~ Genotype + Health,
data=Data.frame)

Data.xtabs

Health

Genotype  disease no_disease

del-del     184         42

ins-del     759        199

ins-ins     807        268

summary(Data.xtabs)                # includes N and factors

Number of cases in table: 2259

Number of factors: 2

###  Chi-square test of independence

chisq.test(Data.xtabs)

X-squared = 7.2594, df = 2, p-value = 0.02652

#     #     #

Power analysis

Power analysis for chi-square test of independence

### --------------------------------------------------------------
### Power analysis, chi-square independence, pp. 66
67
### --------------------------------------------------------------

# This example assumes you are using a Chi-square test of
#   independence.  The example in the Handbook appears to use
#   a Chi-square goodness-of-fit test

# In the pwr package, for the Chi-square test of independence,
#   the table probabilities should sum to 1

Input =("
Genotype  No.cancer Cancer
GG        0.18      0.165
GA        0.24      0.225
AA        0.08      0.110
")

row.names=1))

P

No.cancer Cancer

GG      0.18  0.165

GA      0.24  0.225

AA      0.08  0.110

sum(P)        # Sum of values in the P matrix

[1] 1

library(pwr)

effect.size = ES.w2(P)

degrees = (nrow(P)-1)*(ncol(P)-1)  # Calculate degrees of freedom

pwr.chisq.test(
w=effect.size,
N=NULL,            # Total number of observations
df=degrees,
power=0.80,        # 1 minus Type II probability
sig.level=0.05)    # Type I probability

w = 0.07663476  # Answer differs significantly

N = 1640.537    #   from Handbook

df = 2           # Total observations

sig.level = 0.05

power = 0.8

#     #     #