[banner]

An R Companion for the Handbook of Biological Statistics

Salvatore S. Mangiafico

Chi-square Test of Independence

The Chi-square test of independence can be performed with the chisq.test function in the native stats package in R.  For this test, the function requires the contingency table to be in the form of matrix.  Depending on the form of the data to begin with, this can require an extra step, either combing vectors into a matrix, or cross-tabulating the counts among factors in a data frame.  None of this is too difficult, but it requires following the correct example depending on the initial form of the data. 

 

When using read.table and as.matrix to read a table directly as a matrix, be careful of extra spaces at the end of lines or extraneous characters in the table, as these can cause errors.

 

Examples in Summary and Analysis of Extension Program Evaluation

SAEPER: Association Tests for Nominal Variables

 

Packages used in this chapter

The following commands will install these packages if they are not already installed:


if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(dplyr)){install.packages("dplyr")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(grid)){install.packages("grid")}
if(!require(pwr)){install.packages("pwr")}

When to use it

Example of chi-square test with matrix created with read.table

 

### --------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59
60
###      Example directly reading a table as a matrix
### --------------------------------------------------------------

Input =("
Injection.area  No.severe  Severe
Thigh           4788       30
Arm             8916       76
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz  

 

           No.severe Severe

Thigh      4788     30

Arm        8916     76

 

 

chisq.test(Matriz,
           correct=TRUE)      # Continuity correction for 2 x 2
                              #      table

 

Pearson's Chi-squared test with Yates' continuity correction

 

X-squared = 1.7579, df = 1, p-value = 0.1849

 

 

chisq.test(Matriz,
           correct=FALSE)      # No continuity correction for 2 x 2
                               #      table

 

      Pearson's Chi-squared test

X-squared = 2.0396, df = 1, p-value = 0.1533

 

#     #     #

 

 

Example of chi-square test with matrix created by combining vectors

 

### --------------------------------------------------------------
### Vaccination example, Chi-square independence, pp. 59
60
###      Example creating a matrix from vectors
### --------------------------------------------------------------

R1 = c(4788, 30)
R2 = c(8916, 76)

rows   = 2

Matriz = matrix(c(R1, R2),
                nrow=rows,
                byrow=TRUE)

rownames(Matriz) = c("Thigh", "Arm")          # Naming the rows and
colnames(Matriz) = c("No.severe", "Severe")   #  columns is optional.

Matriz

 

      No.severe Severe

Thigh      4788     30

Arm        8916     76

       

 

chisq.test(Matriz,
           correct=TRUE)      # Continuity correction for 2 x 2
                              #      table

 

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 1.7579, df = 1, p-value = 0.1849

 

 

chisq.test(Matriz,
           correct=FALSE)      # No continuity correction for 2 x 2
                               #      table

 

Pearson's Chi-squared test

X-squared = 2.0396, df = 1, p-value = 0.1533

 

#     #     #

 

 

Null hypothesis

How the test works

See the Handbook for information on these topics.

 

Post-hoc tests

For the following example of post-hoc pairwise testing, we’ll use the pairwiseNominalIndependence function from the package rcompanion to make the task easier.  Then we’ll use pairwise.table in the native stats package as an alternative.

 

Post-hoc pairwise chi-square tests with rcompanion

 

 

### --------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60
61
### --------------------------------------------------------------

Input =("
Supplement     No.cancer  Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

chisq.test(Matriz)

 

X-squared = 7.7832, df = 3, p-value = 0.05071

 

 

library(rcompanion)

pairwiseNominalIndependence(Matriz,
                            fisher = FALSE,
                            gtest  = FALSE,
                            chisq  = TRUE,
                            method = "fdr")


              Comparison p.Chisq p.adj.Chisq
1   Selenium : Vitamin E 0.17700      0.2960
2  Selenium : Selenium+E 0.62800      0.6280
3     Selenium : Placebo 0.19700      0.2960
4 Vitamin E : Selenium+E 0.06260      0.1880
5    Vitamin E : Placebo 0.00771      0.0463
6   Selenium+E : Placebo 0.44000      0.5280

 

 

Post-hoc pairwise chi-square tests with pairwise.table

 

### --------------------------------------------------------------
### Post-hoc example, Chi-square independence, pp. 60
61
### As is, this code works on a matrix with two columns,
###   and compares rows
### --------------------------------------------------------------

Input =("
Supplement     No.cancer  Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

chisq.test(Matriz)

 

X-squared = 7.7832, df = 3, p-value = 0.05071

 

 

FUN = function(i,j){    
      chisq.test(matrix(c(Matriz[i,1], Matriz[i,2],
                          Matriz[j,1], Matriz[j,2]),
                 nrow=2,
                 byrow=TRUE))$ p.value
                }
     
pairwise.table(FUN,
               rownames(Matriz),
               p.adjust.method="none")

                           # Can adjust p-values;
                           # see ?p.adjust for options

 

                Selenium   Vitamin.E Selenium.and.E

Vitamin.E      0.1772113          NA             NA

Selenium.and.E 0.6277621 0.062588260             NA

Placebo        0.1973435 0.007705529      0.4398677

 

#     #     #

 

 

Assumptions

See the Handbook for information on this topic.

 

Examples

Chi-square test of independence with continuity correction and without correction

 

### --------------------------------------------------------------
### Helmet example, Chi-square independence, p. 63
### --------------------------------------------------------------

Input =("
PSE        Head.injury  Other.injury
Helemt     372          4715
No.helmet  267          1391
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz
 
chisq.test(Matriz,
           correct=TRUE)       # Continuity correction for 2 x 2
                               #      table

 

Pearson's Chi-squared test with Yates' continuity correction

X-squared = 111.6569, df = 1, p-value < 2.2e-16

 

 

chisq.test(Matriz,
           correct=FALSE)      # No continuity correction for 2 x 2
                               #      table

 

      Pearson's Chi-squared test

      X-squared = 112.6796, df = 1, p-value < 2.2e-16

 

#     #     #

 

 

Chi-square test of independence

 

### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
###   p. 63
### --------------------------------------------------------------

Input =("
Genotype  No.disease Coronary.disease
'ins/ins'   268        807
'ins/del'   199        759
'del/del'    42        184
")

Matriz = as.matrix(read.table(textConnection(Input),
                   header=TRUE,
                   row.names=1))

Matriz

chisq.test(Matriz)

 

Pearson's Chi-squared test

X-squared = 7.2594, df = 2, p-value = 0.02652

 

#     #     #

 

 

Graphing the results

The first plot below is a bar char with confidence intervals, with a style typical of the ggplot2 package.  The second plot is somewhat more similar to the style of the plot in the Handbook

 

For each example, the code calculates proportions or confidence intervals.  This code could be skipped if those values were determined manually and put in to a data frame from which the plot could be generated.

 

Sometimes factors will need to have the order of their levels specified for ggplot2 to put them in the correct order on the plot.  Otherwise R will alphabetize levels.

 

Simple bar plot with error bars showing confidence intervals

 

### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square test of association,
###   pp. 63
64
### --------------------------------------------------------------

Input =("
Supplement     No.cancer  Cancer
'Selenium'     8177       575
'Vitamin E'    8117       620
'Selenium+E'   8147       555
'Placebo'      8167       529
")
 
Prostate = read.table(textConnection(Input),header=TRUE)


### Add sums and confidence intervals

library(dplyr)

Prostate =
mutate(Prostate,
       Sum = No.cancer + Cancer)

Prostate =
mutate(Prostate,
       Prop = Cancer / Sum,
       low.ci = apply(Prostate[c("Cancer", "Sum")], 1,
                function(y) binom.test(y['Cancer'], y['Sum'])$ conf.int[1]),
       high.ci = apply(Prostate[c("Cancer", "Sum")], 1,
                function(y) binom.test(y['Cancer'], y['Sum'])$ conf.int[2])
       )

Prostate

 

  Supplement No.cancer Cancer  Sum       Prop     low.ci    high.ci

1   Selenium      8177    575 8752 0.06569927 0.06059677 0.07109314

2  Vitamin E      8117    620 8737 0.07096257 0.06566518 0.07654816

3 Selenium+E      8147    555 8702 0.06377844 0.05873360 0.06911770

4    Placebo      8167    529 8696 0.06083257 0.05589912 0.06606271

 

 

### Plot (Bar chart plot)

library(ggplot2)

ggplot(Prostate,
 aes(x=Supplement, y=Prop)) +
 geom_bar(stat="identity", fill="gray40",
          colour="black", size=0.5,
          width=0.7) +
 geom_errorbar(aes(ymax=high.ci, ymin=low.ci),
                   width=0.2, size=0.5, color="black") +
 xlab("Supplement") +
 ylab("Prostate cancer proportion") +
 scale_x_discrete(labels=c("Selenium", "Vitamin E",
                           "Selenium+E","Placebo")) +
 ## ggtitle("Main title") +
 theme(axis.title=element_text(size=14, color="black",
                               face="bold", vjust=3)) +
 theme(axis.text = element_text(size=12, color = "gray25",
                                face="bold")) +
 theme(axis.title.y = element_text(vjust= 1.8)) +
 theme(axis.title.x = element_text(vjust= -0.5))

 

#     #     #

 

Rplot

 

Bar plot of proportions vs. categories.  Error bars indicate 95% confidence intervals for  observed proportion.

 

 

Bar plot with categories and no error bars

 

### --------------------------------------------------------------
### Plot example, herons and egrets, Chi-square independence,
###   p. 64
### --------------------------------------------------------------

Input =("
Habitat      Bird   Count
Vegetation   Heron   15
Shoreline    Heron   20
Water        Heron   14
Structures   Heron    6
Vegetation   Egret    8
Shoreline    Egret    5
Water        Egret    7
Structures   Egret    1
")
 
Birds = read.table(textConnection(Input),header=TRUE)


### Specify the order of factor levels

library(dplyr)

Birds=
mutate(Birds,
       Habitat = factor(Habitat,levels=unique(Habitat)),
       Bird = factor(Bird,levels=unique(Bird))
       )
      

### Add sums and proportions

Birds$ Sum[Birds$ Bird == 'Heron'] =
       sum(Birds$ Count[Birds$ Bird == 'Heron'])

Birds$ Sum[Birds$ Bird == 'Egret'] =
       sum(Birds$ Count[Birds$ Bird == 'Egret'])

Birds=
mutate(Birds,
       prop = Count / Sum
       )

Birds

 

     Habitat  Bird Count Sum       prop

1 Vegetation Heron    15  55 0.27272727

2  Shoreline Heron    20  55 0.36363636

3      Water Heron    14  55 0.25454545

4 Structures Heron     6  55 0.10909091

5 Vegetation Egret     8  21 0.38095238

6  Shoreline Egret     5  21 0.23809524

7      Water Egret     7  21 0.33333333

8 Structures Egret     1  21 0.04761905

 

 

### Plot adapted from:
### shinyapps.stat.ubc.ca/r-graph-catalog/

library(ggplot2)
library(grid)

ggplot(Birds,
  aes(x = Habitat, y = prop, fill = Bird, ymax=0.40, ymin=0)) +
  geom_bar(stat="identity", position = "dodge", width = 0.7) +
  geom_bar(stat="identity", position = "dodge", colour = "black",
           width = 0.7, show_guide = FALSE) +
  scale_y_continuous(breaks = seq(0, 0.40, 0.05),
                     limits = c(0, 0.40),
                     expand = c(0, 0)) +
  scale_fill_manual(name = "Bird type" ,
                    values = c('grey80', 'grey30'),
                    labels = c("Heron (all types)",
                               "Egret (all types)")) +
  ## geom_errorbar(position=position_dodge(width=0.7),
  ##               width=0.0, size=0.5, color="black") +
  labs(x = "Habitat Location", y = "Landing site proportion") +
  ## ggtitle("Main title") +
  theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(colour = "grey50"),
        plot.title = element_text(size = rel(1.5),
                                  face = "bold", vjust = 1.5),
        axis.title = element_text(face = "bold"),
        legend.position = "top",
        legend.title = element_blank(),
        legend.key.size = unit(0.4, "cm"),
        legend.key = element_rect(fill = "black"),
        axis.title.y = element_text(vjust= 1.8),
        axis.title.x = element_text(vjust= -0.5)
       )

 

#     #     #

 

Rplot

 

 

Similar tests

Chi-square vs. G–test

See the Handbook for information on these topics.  Fisher’s exact test, G-test, and McNemar’s test are discussed elsewhere in this book.

 

How to do the test

Chi-square test of independence with data as a data frame

In the following example for the chi-square test of independence, the data is read in as a data frame, not as a matrix as in previous examples.  This allows more flexibility with how data are entered.  For example you could have counts for same genotype and health distributed among several lines, or have a count of 1 for each row, with a separate row for each individual observation.  The xtabs function is used to tabulate the data and convert them to a contingency table.

 

### --------------------------------------------------------------
### Gardemann apolipoprotein example, Chi-square independence,
###      SAS example, pp. 65
66
###      Example using cross-tabulation
### --------------------------------------------------------------

Input =("
Genotype  Health       Count
ins-ins   no_disease   268
ins-ins   disease      807
ins-del   no_disease   199
ins-del   disease      759
del-del   no_disease    42
del-del   disease      184
")

Data.frame = read.table(textConnection(Input),header=TRUE)


###  Cross-tabulate the data

Data.xtabs = xtabs(Count ~ Genotype + Health,
                   data=Data.frame)

Data.xtabs

 

       Health

Genotype  disease no_disease

  del-del     184         42

  ins-del     759        199

  ins-ins     807        268

 

 

summary(Data.xtabs)                # includes N and factors

 

Number of cases in table: 2259

Number of factors: 2

     

 

###  Chi-square test of independence
       
chisq.test(Data.xtabs)

 

X-squared = 7.2594, df = 2, p-value = 0.02652

 

#     #     #

 

 

Power analysis

Power analysis for chi-square test of independence

 

### --------------------------------------------------------------
### Power analysis, chi-square independence, pp. 66
67
### --------------------------------------------------------------

# This example assumes you are using a Chi-square test of
#   independence.  The example in the Handbook appears to use
#   a Chi-square goodness-of-fit test

# In the pwr package, for the Chi-square test of independence,
#   the table probabilities should sum to 1

Input =("
Genotype  No.cancer Cancer
GG        0.18      0.165
GA        0.24      0.225
AA        0.08      0.110
")

P = as.matrix(read.table(textConnection(Input),
              header=TRUE,
              row.names=1))

P

   No.cancer Cancer

GG      0.18  0.165

GA      0.24  0.225

AA      0.08  0.110

 

 

sum(P)        # Sum of values in the P matrix

 

[1] 1

 

 

library(pwr)

effect.size = ES.w2(P) 

degrees = (nrow(P)-1)*(ncol(P)-1)  # Calculate degrees of freedom

pwr.chisq.test(
       w=effect.size,
       N=NULL,            # Total number of observations
       df=degrees,
       power=0.80,        # 1 minus Type II probability
       sig.level=0.05)    # Type I probability
      

                        

        w = 0.07663476  # Answer differs significantly

        N = 1640.537    #   from Handbook

       df = 2           # Total observations

sig.level = 0.05

    power = 0.8

 

#     #     #