The Kruskal–Wallis test is a rank-based test that is similar to the Mann–Whitney U test, but can be applied to one-way data with more than two groups. A significant result suggests that values for the groups are not all the same. The test is useful to compare the scores or ratings from multiple speakers, presentations, or groups of audiences.

If the shape and spread of the distributions of values of each group is similar, then the test compares the medians of the two groups. Otherwise, the test is really testing if the distributions of values of the two groups differ. Outliers affect the spread of the data; so, if there are outliers, the test doesn’t reliably test for the median.

The test is performed with the *kruskal.test* function.

If the distributions of values of each group are similar in shape, but have outliers, then Mood’s median test is an appropriate alternative. Mood’s median test is described in the next chapter.

##### Post-hoc tests

The outcome of the Kruskal–Wallis test tells you if there
are differences among the groups, but doesn’t tell you *which* groups are
different from other groups. In order to determine which groups are different
from others, post-hoc testing can be conducted. Probably the most common
post-hoc test for the Kruskal–Wallis test is the Dunn test, here conducted with
the *dunnTest* function in the *FSA* package. An alternative to this
is to conduct Mann–Whitney tests on each pair of groups. This is accomplished
with *pairwise.wilcox.test* function.

##### Appropriate data

• One-way data

• Dependent variable is ordinal, interval, or ratio

• Independent variable is a factor with two or more levels. That is, two or more groups

• Observations between groups are independent. That is, not paired or repeated measures data

• In order to be a test of medians, the distributions of values for each group need to be of similar shape and spread. Otherwise the test is a test of distributions.

##### Hypotheses

*If the distributions of the two groups are similar in
shape and spread:*

• Null hypothesis: The medians of values for each group are equal.

• Alternative hypothesis (two-sided): The medians of values for each group are not equal.

*If the distributions of the two groups are not similar in
shape and spread:*

• Null hypothesis: The distribution of values for each group are equal.

• Alternative hypothesis (two-sided): The distribution of values for each group are not equal.

##### Interpretation

*If the distributions of the two groups are similar in
shape and spread:*

Significant results can be reported as “There was a significant difference in median values across groups.”

Post-hoc analysis allows you to say “The median for group A was higher than the median for group B”, and so on.

*If the distributions of the two groups are not similar in
shape and spread:*

Significant results can be reported as “There was a significant difference in distributions of values among groups.” Or, “There was a significant difference in the distributions of values among groups.” Or, as “There was a significant difference in values among groups.”

##### Other notes and alternative tests

Mood’s median test compares the medians of groups. It is described in the next chapter.

Another alternative is to use cumulative link models for ordinal data, which are described later in this book.

### Packages used in this chapter

The packages used in this chapter include:

• psych

• FSA

• lattice

• multcompView

• rcompanion

• rcompanion

The following commands will install these packages if they are not already installed:

if(!require(psych)){install.packages("psych")}

if(!require(FSA)){install.packages("FSA")}

if(!require(lattice)){install.packages("lattice")}

if(!require(multcompView)){install.packages("multcompView")}

if(!require(rcompanion)){install.packages("rcompanion")}

if(!require(rcompanion)){install.packages("rcompanion")}

### Kruskal–Wallis test example

This example re-visits the Pooh, Piglet, and Tigger data
from the *Descriptive Statistics with the likert Package* chapter.

It answers the question, “Are the scores significantly different among the three speakers?”

The Kruskal–Wallis test is conducted with the *kruskal.test*
function, which produces a *p*-value for the hypothesis. First the data
are summarized and examined using histograms for each group.

Note that because the histogram shows that the distributions of scores for each of the speakers are relatively similar in shape and spread, the Kruskal–Wallis test can be interpreted as a test of medians.

Input =("

Speaker Likert

Pooh 3

Pooh 5

Pooh 4

Pooh 4

Pooh 4

Pooh 4

Pooh 4

Pooh 4

Pooh 5

Pooh 5

Piglet 2

Piglet 4

Piglet 2

Piglet 2

Piglet 1

Piglet 2

Piglet 3

Piglet 2

Piglet 2

Piglet 3

Tigger 4

Tigger 4

Tigger 4

Tigger 4

Tigger 5

Tigger 3

Tigger 5

Tigger 4

Tigger 4

Tigger 3

")

Data = read.table(textConnection(Input),header=TRUE)

### Order levels of the factor; otherwise R will
alphabetize them

Data$Speaker = factor(Data$Speaker,

levels=unique(Data$Speaker))

### Create a new variable which is the likert
scores as an ordered factor

Data$Likert.f = factor(Data$Likert,

ordered = TRUE)

### Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)

### Remove unnecessary objects

rm(Input)

#### Summarize data treating Likert scores as factors

xtabs( ~ Speaker + Likert.f,

data = Data)

Likert.f

Speaker 1 2 3 4 5

Pooh 0 0 1 6 3

Piglet 1 6 2 1 0

Tigger 0 0 2 6 2

XT = xtabs( ~ Speaker + Likert.f,

data = Data)

prop.table(XT,

margin = 1)

Likert.f

Speaker 1 2 3 4 5

Pooh 0.0 0.0 0.1 0.6 0.3

Piglet 0.1 0.6 0.2 0.1 0.0

Tigger 0.0 0.0 0.2 0.6 0.2

#### Histograms by group

library(lattice)

histogram(~ Likert.f | Speaker,

data=Data,

layout=c(1,3) # columns and rows of
individual plots

)

#### Summarize data treating Likert scores as numeric

library(FSA)

Summarize(Likert ~ Speaker,

data=Data,

digits=3)

Speaker n mean sd min Q1 median Q3 max percZero

1 Pooh 10 4.2 0.632 3 4 4 4.75 5 0

2 Piglet 10 2.3 0.823 1 2 2 2.75 4 0

3 Tigger 10 4.0 0.667 3 4 4 4.00 5 0

#### Kruskal–Wallis test example

This example uses the formula notation indicating that *Likert*
is the dependent variable and *Speaker* is the independent variable. The *data=*
option indicates the data frame that contains the variables. For the meaning
of other options, see *?kruskal.test*.

kruskal.test(Likert ~ Speaker,

data = Data)

Kruskal-Wallis rank sum test

Kruskal-Wallis chi-squared = 16.842, df = 2, p-value = 0.0002202

#### Post-hoc test: Dunn test for multiple comparisons of groups

If the Kruskal–Wallis test is significant, a post-hoc analysis can be performed to determine which groups differ from each other group.

Probably the most popular host-hoc test for the
Kruskal–Wallis test is the Dunn test. The Dunn test can be conducted with the *dunnTest*
function in the *FSA* package.

Because the post-hoc test will produce multiple *p*-values,
adjustments to the *p*-values can be made to avoid inflating the
possibility of making a type-I error. There are a variety of methods for controlling
the familywise error rate or for controlling the false discovery rate. See *?p.adjust*
for details on these methods.

### Order groups by median

Data$Speaker = factor(Data$Speaker,

levels=c("Pooh", "Tigger",
"Piglet"))

Data

### Dunn test

library(FSA)

DT = dunnTest(Likert ~ Speaker,

data=Data,

method="bh") # Adjusts
p-values for multiple comparisons;

# See ?dunnTest
for options

DT

Dunn (1964) Kruskal-Wallis multiple comparison

p-values adjusted with the Benjamini-Hochberg method.

Comparison Z P.unadj P.adj

1 Pooh - Tigger 0.4813074 0.6302980448 0.6302980448

2 Pooh - Piglet 3.7702412 0.0001630898 0.0004892695

3 Tigger - Piglet 3.2889338 0.0010056766 0.0015085149

### Compact letter display

PT = DT$res

PT

library(rcompanion)

cldList(comparison = PT$Comparison,

p.value = PT$P.adj,

threshold = 0.05)

Group Letter MonoLetter

1 Pooh a a

2 Tigger a a

3 Piglet b b

Groups sharing a letter not signficantly different
(alpha = 0.05).

#### Post-hoc test: pairwise Mann–Whitney U-tests for multiple comparisons

Another approach to post-hoc testing for the Kruskal–Wallis
test is to use Mann–Whitney U-tests for each pair of groups.

This can be conducted with the *pairwise.wilcox.test* function. This
produces a table of *p*-values comparing each pair of groups.

To prevent the inflation of type I error rates, adjustments to the *p*-values
can be made using the *p.adjust.method* option. Here the* fdr*
method is used. See *?p.adjust* for details on available *p*-value
adjustment methods.

When there are many *p*-values to evaluate, it is useful to condense a
table of *p*-values to a compact letter display format. This can be
accomplished with a combination of my *fullPTable* function and the *multcompLetters*
function in the *multcompView* package.

Note that the *p*-value results of the pairwise Mann–Whitney U-tests
differ somewhat from those of the Dunn test.

A compact letter display condenses a table of *p*-values into a simpler
format. In the output, groups are separated by letters. Groups sharing the
same letter are not significantly different. Compact letter displays are a
clear and succinct way to present results of multiple comparisons.

Here the* fdr* *p*-value adjustment method is used. See *?p.adjust*
for details on available methods.

The code creates a matrix of *p*-values called *PT*, then converts
this to a fuller matrix called *PT1*. *PT1* is then passed to the *multcompLetters*
function to be converted to a compact letter display.

### Order groups by median

Data$Speaker = factor(Data$Speaker,

levels=c("Pooh", "Tigger",
"Piglet"))

Data

### Pairwise Mann–Whitney

PT = pairwise.wilcox.test(Data$Likert,

Data$Speaker,

p.adjust.method="fdr")

# Adjusts p-values for
multiple comparisons;

# See ?p.adjust for
options

PT

Pairwise comparisons using Wilcoxon rank sum test

Pooh Tigger

Tigger 0.5174 -

Piglet 0.0012 0.0012

P value adjustment method: fdr

### Note that the values in the table are p-values
comparing each

### pair of groups.

### Convert PT to a full table and call
it PT1

library(rcompanion)

PT1 = fullPTable(PT)

PT1

Pooh Tigger Piglet

Pooh 1.000000000 0.517377650 0.001241095

Tigger 0.517377650 1.000000000 0.001241095

Piglet 0.001241095 0.001241095 1.000000000

### Produce compact letter display

library(multcompView)

multcompLetters(PT1,

compare="<",

threshold=0.05, # p-value to use
as significance threshold

Letters=letters,

reversed = FALSE)

Pooh Tigger Piglet

"a" "a" "b"

### Values sharing a letter are not significantly
different

### Plot of medians and confidence intervals

The following code uses the *groupwiseMedian* function
to produce a data frame of medians for each speaker along with the 95%
confidence intervals for each median with the percentile method.

These medians are then plotted, with their confidence intervals shown as error bars. The grouping letters from the multiple comparisons (Dunn test or pairwise Mann–Whitney U-tests) are added.

library(rcompanion)

Sum = groupwiseMedian(data=Data,

group="Speaker",

var="Likert",

conf=0.95,

R=5000,

percentile=TRUE,

bca=FALSE,

digits=3)

Sum

Speaker n Median Conf.level Percentile.lower Percentile.upper

1 Pooh 10 4 0.95 4.0 5.0

2 Piglet 10 2 0.95 2.0 3.0

3 Tigger 10 4 0.95 3.5 4.5

X = 1:3

Y = Sum$Percentile.upper+0.2

Label = c("a", "b", "a")

library(ggplot2)

ggplot(Sum, ### The data frame to
use.

aes(x = Speaker,

y = Median)) +

geom_errorbar(aes(ymin = Percentile.lower,

ymax = Percentile.upper),

width = 0.05,

size = 0.5) +

geom_point(shape = 15,

size = 4) +

theme_bw() +

theme(axis.title = element_text(face = "bold")) +

ylab("Median Likert score")+

annotate("text",

x = X,

y = Y,

label = Label)

Plot of median Likert versus Speaker. Error bars indicate the 95% confidence
intervals for the median with the percentile method.

### Exercises L

1. Considering Pooh, Piglet, and Tigger’s data,

What was the median score for each instructor?

Are the data for all** **instructors reasonably similar
shape and spread?

Based on your previous answer, what is the null hypothesis for
the Kruskal–Wallis test?

According to the Kruskal–Wallis test, is there a statistical difference
in scores among the instructors?

Looking at the post-hoc analysis, which speakers’ scores are statistically different from which others? Who had the statistically highest scores?

2. Brian, Stewie, and Meg want to assess the education level of students in
their courses on creative writing for adults. They want to know the median
education level for each class, and if the education level of the classes were
different among instructors.

They used the following table to code his data.

Code Abbreviation Level

1 < HS Less than high school

2 HS High school

3 BA Bachelor’s

4 MA Master’s

5 PhD Doctorate

The following are the course data.

Instructor Student Education

'Brian Griffin' a 3

'Brian Griffin' b 2

'Brian Griffin' c 3

'Brian Griffin' d 3

'Brian Griffin' e 3

'Brian Griffin' f 3

'Brian Griffin' g 4

'Brian Griffin' h 5

'Brian Griffin' i 3

'Brian Griffin' j 4

'Brian Griffin' k 3

'Brian Griffin' l 2

'Stewie Griffin' m 4

'Stewie Griffin' n 5

'Stewie Griffin' o 4

'Stewie Griffin' p 4

'Stewie Griffin' q 4

'Stewie Griffin' r 4

'Stewie Griffin' s 3

'Stewie Griffin' t 5

'Stewie Griffin' u 4

'Stewie Griffin' v 4

'Stewie Griffin' w 3

'Stewie Griffin' x 2

'Meg Griffin' y 3

'Meg Griffin' z 4

'Meg Griffin' aa 3

'Meg Griffin' ab 3

'Meg Griffin' ac 3

'Meg Griffin' ad 2

'Meg Griffin' ae 3

'Meg Griffin' af 4

'Meg Griffin' ag 2

'Meg Griffin' ah 3

'Meg Griffin' ai 2

'Meg Griffin' aj 1

For each of the following, answer the question, and ** show
the output from the analyses you used to answer the question**.

What was the median education level for each instructor’s class? (Be sure to report the education level, not just the numeric code!)

Are the distributions of education levels for all** **instructors
reasonably similar shape and spread?

Based on your previous answer, what is the null hypothesis for
the Kruskal–Wallis test?

According to the Kruskal–Wallis test, is there a difference in the education level of students among the instructors?

Looking at the post-hoc analysis, which classes education levels are different from which others? Who had the statistically highest education level?

Plot Brian, Stewie, and Meg’s data in a way that helps you visualize the data. Do the results reflect what you would expect from looking at the plot? How would you summarize the results of the descriptive statistics and tests?