[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Kruskal–Wallis Test

The Kruskal–Wallis test is a rank-based test that is similar to the Mann–Whitney U test, but can be applied to one-way data with more than two groups.  A significant result suggests that values for the groups are not all the same.  The test is useful to compare the scores or ratings from multiple speakers, presentations, or groups of audiences.

 

If the shape and spread of the distributions of values of each group is similar, then the test compares the medians of the two groups.  Otherwise, the test is really testing if the distributions of values of the two groups differ.  Outliers affect the spread of the data; so, if there are outliers, the test doesn’t reliably test for the median.

 

The test is performed with the kruskal.test function.

 

If the distributions of values of each group are similar in shape, but have outliers, then Mood’s median test is an appropriate alternative.  Mood’s median test is described in the next chapter.

 

Post-hoc tests

The outcome of the Kruskal–Wallis test tells you if there are differences among the groups, but doesn’t tell you which groups are different from other groups.  In order to determine which groups are different from others, post-hoc testing can be conducted.  Probably the most common post-hoc test for the Kruskal–Wallis test is the Dunn test, here conducted with the dunnTest function in the FSA package.  An alternative to this is to conduct Mann–Whitney tests on each pair of groups.  This is accomplished with pairwise.wilcox.test function.

 

Appropriate data

•  One-way data

•  Dependent variable is ordinal, interval, or ratio

•  Independent variable is a factor with two or more levels.  That is, two or more groups

•  Observations between groups are independent.  That is, not paired or repeated measures data

•  In order to be a test of medians, the distributions of values for each group need to be of similar shape and spread.  Otherwise the test is a test of distributions.

 

Hypotheses

If the distributions of the two groups are similar in shape and spread:

•  Null hypothesis:  The medians of values for each group are equal.

•  Alternative hypothesis (two-sided): The medians of values for each group are not equal.

 

If the distributions of the two groups are not similar in shape and spread:

•  Null hypothesis:  The distribution of values for each group are equal.

•  Alternative hypothesis (two-sided): The distribution of values for each group are not equal.

 

Interpretation

If the distributions of the two groups are similar in shape and spread:

Significant results can be reported as “There was a significant difference in median values across groups.”

Post-hoc analysis allows you to say “The median for group A was higher than the median for group B”, and so on.

 

If the distributions of the two groups are not similar in shape and spread:

Significant results can be reported as “There was a significant difference in distributions of values among groups.”  Or, “There was a significant difference in the distributions of values among groups.”  Or, as “There was a significant difference in values among groups.”

 

Other notes and alternative tests

Mood’s median test compares the medians of groups.  It is described in the next chapter.

 

Another alternative is to use cumulative link models for ordinal data, which are described later in this book.

 

Packages used in this chapter

 

The packages used in this chapter include:

•  psych

•  FSA

•  lattice

•  multcompView

•  rcompanion

•  rcompanion

 

The following commands will install these packages if they are not already installed:


if(!require(psych)){install.packages("psych")}
if(!require(FSA)){install.packages("FSA")}
if(!require(lattice)){install.packages("lattice")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(rcompanion)){install.packages("rcompanion")}


Kruskal–Wallis test example

 

This example re-visits the Pooh, Piglet, and Tigger data from the Descriptive Statistics with the likert Package chapter.

 

It answers the question, “Are the scores significantly different among the three speakers?”

 

The Kruskal–Wallis test is conducted with the kruskal.test function, which produces a p-value for the hypothesis.  First the data are summarized and examined using histograms for each group.

 

Note that because the histogram shows that the distributions of scores for each of the speakers are relatively similar in shape and spread, the Kruskal–Wallis test can be interpreted as a test of medians.


Input =("
 Speaker  Likert
 Pooh      3
 Pooh      5
 Pooh      4
 Pooh      4
 Pooh      4
 Pooh      4
 Pooh      4
 Pooh      4
 Pooh      5
 Pooh      5
 Piglet    2
 Piglet    4
 Piglet    2
 Piglet    2
 Piglet    1
 Piglet    2
 Piglet    3
 Piglet    2
 Piglet    2
 Piglet    3
 Tigger    4
 Tigger    4
 Tigger    4
 Tigger    4
 Tigger    5
 Tigger    3
 Tigger    5
 Tigger    4
 Tigger    4
 Tigger    3
")

Data = read.table(textConnection(Input),header=TRUE)

### Order levels of the factor; otherwise R will alphabetize them

Data$Speaker = factor(Data$Speaker,
                      levels=unique(Data$Speaker))

### Create a new variable which is the likert scores as an ordered factor

Data$Likert.f = factor(Data$Likert,
                       ordered = TRUE)


###  Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Summarize data treating Likert scores as factors


xtabs( ~ Speaker + Likert.f,
       data = Data)


        Likert.f
Speaker  1 2 3 4 5
  Pooh   0 0 1 6 3
  Piglet 1 6 2 1 0
  Tigger 0 0 2 6 2


XT = xtabs( ~ Speaker + Likert.f,
            data = Data)


prop.table(XT,
           margin = 1)


        Likert.f
Speaker    1   2   3   4   5
  Pooh   0.0 0.0 0.1 0.6 0.3
  Piglet 0.1 0.6 0.2 0.1 0.0
  Tigger 0.0 0.0 0.2 0.6 0.2


Histograms by group


library(lattice)

histogram(~ Likert.f | Speaker,
          data=Data,
          layout=c(1,3)      #  columns and rows of individual plots
          )

image


Summarize data treating Likert scores as numeric


library(FSA)

Summarize(Likert ~ Speaker,
          data=Data,
          digits=3)


  Speaker  n mean    sd min Q1 median   Q3 max percZero
1    Pooh 10  4.2 0.632   3  4      4 4.75   5        0
2  Piglet 10  2.3 0.823   1  2      2 2.75   4        0
3  Tigger 10  4.0 0.667   3  4      4 4.00   5        0


Kruskal–Wallis test example

This example uses the formula notation indicating that Likert is the dependent variable and Speaker is the independent variable.  The data= option indicates the data frame that contains the variables.  For the meaning of other options, see ?kruskal.test.


kruskal.test(Likert ~ Speaker,
             data = Data)


Kruskal-Wallis rank sum test

Kruskal-Wallis chi-squared = 16.842, df = 2, p-value = 0.0002202


Post-hoc test: Dunn test for multiple comparisons of groups

If the Kruskal–Wallis test is significant, a post-hoc analysis can be performed to determine which groups differ from each other group. 

 

Probably the most popular host-hoc test for the Kruskal–Wallis test is the Dunn test.  The Dunn test can be conducted with the dunnTest function in the FSA package. 

 

Because the post-hoc test will produce multiple p-values, adjustments to the p-values can be made to avoid inflating the possibility of making a type-I error.  There are a variety of methods for controlling the familywise error rate or for controlling the false discovery rate.  See ?p.adjust for details on these methods.


### Order groups by median

Data$Speaker = factor(Data$Speaker,
                      levels=c("Pooh", "Tigger", "Piglet"))

Data


### Dunn test

library(FSA)

DT = dunnTest(Likert ~ Speaker,
              data=Data,
              method="bh")      # Adjusts p-values for multiple comparisons;
                                # See ?dunnTest for options

DT


Dunn (1964) Kruskal-Wallis multiple comparison
  p-values adjusted with the Benjamini-Hochberg method.

       Comparison         Z      P.unadj        P.adj
1   Pooh - Tigger 0.4813074 0.6302980448 0.6302980448
2   Pooh - Piglet 3.7702412 0.0001630898 0.0004892695
3 Tigger - Piglet 3.2889338 0.0010056766 0.0015085149


### Compact letter display


PT = DT$res

PT

library(rcompanion)

cldList(comparison = PT$Comparison,
        p.value    = PT$P.adj,
        threshold  = 0.05)


   Group Letter MonoLetter
1   Pooh      a         a
2 Tigger      a         a
3 Piglet      b          b

Groups sharing a letter not signficantly different (alpha = 0.05).


Post-hoc test: pairwise Mann–Whitney U-tests for multiple comparisons

Another approach to post-hoc testing for the Kruskal–Wallis test is to use Mann–Whitney U-tests for each pair of groups. 

This can be conducted with the pairwise.wilcox.test function.  This produces a table of p-values comparing each pair of groups.

To prevent the inflation of type I error rates, adjustments to the p-values can be made using the p.adjust.method option.  Here the fdr method is used.  See ?p.adjust for details on available p-value adjustment methods.

When there are many p-values to evaluate, it is useful to condense a table of p-values to a compact letter display format.  This can be accomplished with a combination of my fullPTable function and the multcompLetters function in the multcompView package.

Note that the p-value results of the pairwise Mann–Whitney U-tests differ somewhat from those of the Dunn test.

A compact letter display condenses a table of p-values into a simpler format.  In the output, groups are separated by letters.  Groups sharing the same letter are not significantly different.  Compact letter displays are a clear and succinct way to present results of multiple comparisons.

Here the fdr p-value adjustment method is used.  See ?p.adjust for details on available methods.


The code creates a matrix of p-values called PT, then converts this to a fuller matrix called PT1PT1 is then passed to the multcompLetters function to be converted to a compact letter display.


### Order groups by median

Data$Speaker = factor(Data$Speaker,
                      levels=c("Pooh", "Tigger", "Piglet"))

Data


### Pairwise Mann–Whitney

PT = pairwise.wilcox.test(Data$Likert,
                          Data$Speaker,
                          p.adjust.method="fdr")
                           # Adjusts p-values for multiple comparisons;
                           # See ?p.adjust for options

PT


Pairwise comparisons using Wilcoxon rank sum test

       Pooh   Tigger
Tigger 0.5174 -    
Piglet 0.0012 0.0012

P value adjustment method: fdr 

   ### Note that the values in the table are p-values comparing each
   ###   pair of groups.


### Convert PT to a full table and call it PT1

library(rcompanion)

PT1 = fullPTable(PT)

PT1


              Pooh      Tigger      Piglet
Pooh   1.000000000 0.517377650 0.001241095
Tigger 0.517377650 1.000000000 0.001241095
Piglet 0.001241095 0.001241095 1.000000000


### Produce compact letter display

library(multcompView)

multcompLetters(PT1,
                compare="<",
                threshold=0.05,  # p-value to use as significance threshold
                Letters=letters,
                reversed = FALSE)


  Pooh Tigger Piglet
   "a"    "a"    "b"

   ### Values sharing a letter are not significantly different

Plot of medians and confidence intervals

 

The following code uses the groupwiseMedian function to produce a data frame of medians for each speaker along with the 95% confidence intervals for each median with the percentile method.

 

These medians are then plotted, with their confidence intervals shown as error bars.  The grouping letters from the multiple comparisons (Dunn test or pairwise Mann–Whitney U-tests) are added.


library(rcompanion)

Sum = groupwiseMedian(data=Data,
                      group="Speaker",
                      var="Likert",
                      conf=0.95,
                      R=5000,
                      percentile=TRUE,
                      bca=FALSE,
                      digits=3)

Sum


  Speaker  n Median Conf.level Percentile.lower Percentile.upper
1    Pooh 10      4       0.95              4.0              5.0
2  Piglet 10      2       0.95              2.0              3.0
3  Tigger 10      4       0.95              3.5              4.5


X = 1:3
Y = Sum$Percentile.upper+0.2
Label = c("a", "b", "a")


library(ggplot2)

ggplot(Sum,                ### The data frame to use.
       aes(x = Speaker,
           y = Median)) +
   geom_errorbar(aes(ymin = Percentile.lower,
                     ymax = Percentile.upper),
                     width = 0.05,
                     size  = 0.5) +
   geom_point(shape = 15,
              size  = 4) +
   theme_bw() +
   theme(axis.title   = element_text(face  = "bold")) +
   ylab("Median Likert score")+

annotate("text",
         x = X,
         y = Y,
         label = Label)


image

Plot of median Likert versus Speaker.  Error bars indicate the 95% confidence intervals for the median with the percentile method.



Exercises L


1. Considering Pooh, Piglet, and Tigger’s data,

What was the median score for each instructor?

Are the data for all instructors reasonably similar shape and spread?

Based on your previous answer, what is the null hypothesis for the Kruskal–Wallis test?

According to the Kruskal–Wallis test, is there a statistical difference in scores among the instructors?

Looking at the post-hoc analysis, which speakers’ scores are statistically different from which others?  Who had the statistically highest scores?



2. Brian, Stewie, and Meg want to assess the education level of students in their courses on creative writing for adults.  They want to know the median education level for each class, and if the education level of the classes were different among instructors.

 

They used the following table to code his data.

 

Code   Abbreviation   Level

1      < HS           Less than high school
2        HS           High school
3        BA           Bachelor’s
4        MA           Master’s
5        PhD          Doctorate

 

The following are the course data.

 

Instructor        Student  Education
'Brian Griffin'   a        3
'Brian Griffin'   b        2
'Brian Griffin'   c        3
'Brian Griffin'   d        3
'Brian Griffin'   e        3
'Brian Griffin'   f        3
'Brian Griffin'   g        4
'Brian Griffin'   h        5
'Brian Griffin'   i        3
'Brian Griffin'   j        4
'Brian Griffin'   k        3
'Brian Griffin'   l        2
'Stewie Griffin'  m        4
'Stewie Griffin'  n        5
'Stewie Griffin'  o        4
'Stewie Griffin'  p        4
'Stewie Griffin'  q        4
'Stewie Griffin'  r        4
'Stewie Griffin'  s        3
'Stewie Griffin'  t        5
'Stewie Griffin'  u        4
'Stewie Griffin'  v        4
'Stewie Griffin'  w        3
'Stewie Griffin'  x        2
'Meg Griffin'     y        3
'Meg Griffin'     z        4
'Meg Griffin'     aa       3
'Meg Griffin'     ab       3
'Meg Griffin'     ac       3
'Meg Griffin'     ad       2
'Meg Griffin'     ae       3
'Meg Griffin'     af       4
'Meg Griffin'     ag       2
'Meg Griffin'     ah       3
'Meg Griffin'     ai       2
'Meg Griffin'     aj       1


For each of the following, answer the question, and show the output from the analyses you used to answer the question.

 

What was the median education level for each instructor’s class?  (Be sure to report the education level, not just the numeric code!)

 

Are the distributions of education levels for all instructors reasonably similar shape and spread?

Based on your previous answer, what is the null hypothesis for the Kruskal–Wallis test?

According to the Kruskal–Wallis test, is there a difference in the education level of students among the instructors?

 

Looking at the post-hoc analysis, which classes education levels are different from which others?  Who had the statistically highest education level?

 

Plot Brian, Stewie, and Meg’s data in a way that helps you visualize the data.  Do the results reflect what you would expect from looking at the plot?  How would you summarize the results of the descriptive statistics and tests?