The Kruskal–Wallis test is a rank-based test that is similar to the Mann–Whitney U test, but can be applied to one-way data with more than two groups.
Without further assumptions about the distribution of the data, the Kruskal–Wallis test does not address hypotheses about the medians of the groups. Instead, the test addresses if it is likely that an observation in one group is greater than an observation in the other. This is sometimes stated as testing if one sample has stochastic dominance compared with the other.
The test assumes that the observations are independent. That is, it is not appropriate for paired observations or repeated measures data.
It is performed with the kruskal.test function.
Appropriate effect size statistics include maximum Vargha and Delaney’s A, maximum Cliff’s delta, Freeman’s theta, and epsilon-squared.
Post-hoc tests
The outcome of the Kruskal–Wallis test tells you if there are differences among the groups, but doesn’t tell you which groups are different from other groups. In order to determine which groups are different from others, post-hoc testing can be conducted. Probably the most common post-hoc test for the Kruskal–Wallis test is the Dunn test, here conducted with the dunnTest function in the FSA package.
Appropriate data
• One-way data
• Dependent variable is ordinal, interval, or ratio
• Independent variable is a factor with two or more levels. That is, two or more groups
• Observations between groups are independent. That is, not paired or repeated measures data
• In order to be a test of medians, the distributions of values for each group need to be of similar shape and spread. Otherwise the test is typically a test of stochastic equality.
Hypotheses
• Null hypothesis: The groups are sampled from populations with identical distributions. Typically, that the sampled populations exhibit stochastic equality.
• Alternative hypothesis (two-sided): The groups are sampled from populations with different distributions. Typically, that one sampled population exhibits stochastic dominance.
Interpretation
Significant results can be reported as “There was a significant difference in values among groups.”
Post-hoc analysis allows you to say “There was a significant difference in values between groups A and B.”, and so on.
Other notes and alternative tests
Mood’s median test compares the medians of groups.
Packages used in this chapter
The packages used in this chapter include:
• psych
• FSA
• lattice
• multcompView
• rcompanion
The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(FSA)){install.packages("FSA")}
if(!require(lattice)){install.packages("lattice")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(rcompanion)){install.packages("rcompanion")}
Kruskal–Wallis test example
This example re-visits the Pooh, Piglet, and Tigger data from the Descriptive Statistics with the likert Package chapter.
It answers the question, “Are the scores significantly different among the three speakers?”
The Kruskal–Wallis test is conducted with the kruskal.test function, which produces a p-value for the hypothesis. First the data are summarized and examined using bar plots for each group.
Input =("
Speaker Likert
Pooh 3
Pooh 5
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 5
Pooh 5
Piglet 2
Piglet 4
Piglet 2
Piglet 2
Piglet 1
Piglet 2
Piglet 3
Piglet 2
Piglet 2
Piglet 3
Tigger 4
Tigger 4
Tigger 4
Tigger 4
Tigger 5
Tigger 3
Tigger 5
Tigger 4
Tigger 4
Tigger 3
")
Data = read.table(textConnection(Input),header=TRUE)
### Order levels of the factor; otherwise R will alphabetize them
Data$Speaker = factor(Data$Speaker,
levels=unique(Data$Speaker))
### Create a new variable which is the likert scores as an ordered factor
Data$Likert.f = factor(Data$Likert,
ordered = TRUE)
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Summarize data treating Likert scores as factors
xtabs( ~ Speaker + Likert.f,
data = Data)
Likert.f
Speaker 1 2 3 4 5
Pooh 0 0 1 6 3
Piglet 1 6 2 1 0
Tigger 0 0 2 6 2
XT = xtabs( ~ Speaker + Likert.f,
data = Data)
prop.table(XT,
margin = 1)
Likert.f
Speaker 1 2 3 4 5
Pooh 0.0 0.0 0.1 0.6 0.3
Piglet 0.1 0.6 0.2 0.1 0.0
Tigger 0.0 0.0 0.2 0.6 0.2
Bar plots of data by group
library(lattice)
histogram(~ Likert.f | Speaker,
data=Data,
layout=c(1,3) # columns and rows of
individual plots
)

Summarize data treating Likert scores as numeric
library(FSA)
Summarize(Likert ~ Speaker,
data=Data,
digits=3)
Speaker n mean sd min Q1 median Q3 max percZero
1 Pooh 10 4.2 0.632 3 4 4 4.75 5 0
2 Piglet 10 2.3 0.823 1 2 2 2.75 4 0
3 Tigger 10 4.0 0.667 3 4 4 4.00 5 0
Kruskal–Wallis test example
This example uses the formula notation indicating that Likert is the dependent variable and Speaker is the independent variable. The data= option indicates the data frame that contains the variables. For the meaning of other options, see ?kruskal.test.
kruskal.test(Likert ~ Speaker,
data = Data)
Kruskal-Wallis rank sum test
Kruskal-Wallis chi-squared = 16.842, df = 2, p-value = 0.0002202
Effect size
Statistics of effect size for the Kruskal–Wallis test provide the degree to which one group has data with higher ranks than another group. They are related to the probability that a value from one group will be greater than a value from another group. Unlike p-values, they are not affected by sample size.
Appropriate effect size statistics for the Kruskal–Wallis test include Freeman’s theta and epsilon-squared. epsilon-squared is probably the most common. For Freeman’s theta, an effect size of 1 indicates that the measurements for each group are entirely greater or entirely less than some other group, and an effect size of 0 indicates that there is no effect; that is, that the groups are absolutely stochastically equal.
Another option is to use the maximum Cliff’s delta or Vargha and Delaney’s A (VDA) from pairwise comparisons of all groups. VDA is the probability that an observation from one group is greater than an observation from the other group. Because of this interpretation, VDA is an effect size statistic that is relatively easy to understand.
Interpretation of effect sizes necessarily varies by discipline and the expectations of the experiment. The following guidelines are based on my personal intuition or published values. They should not be considered universal.
Technical note: The values for the interpretations for Freeman’s theta to epsilon-squared below were derived by keeping the interpretation for epsilon-squared constant and equal to that for the Mann–Whitney test. Interpretation values for Freeman’s theta were determined through comparing Freeman’s theta to epsilon-squared for simulated data (5-point Likert items, n per group between 4 and 25).
Interpretations for Vargha and Delaney’s A and Cliff’s delta come from Vargha and Delaney (2000).
|
|
small
|
medium |
large |
|
epsilon-squared |
0.01 – < 0.08 |
0.08 – < 0.26 |
≥ 0.26 |
|
Freeman’s theta, k = 2 |
0.11 – < 0.34 |
0.34 – < 0.58 |
≥ 0.58 |
|
Freeman’s theta, k = 3 |
0.05 – < 0.26 |
0.26 – < 0.46 |
≥ 0.46 |
|
Freeman’s theta, k = 5 |
0.05 – < 0.21 |
0.21 – < 0.40 |
≥ 0.40 |
|
Freeman’s theta, k = 7 |
0.05 – < 0.20 |
0.20 – < 0.38 |
≥ 0.38 |
|
Freeman’s theta, k = 7 |
0.05 – < 0.20 |
0.20 – < 0.38 |
≥ 0.38 |
|
Maximum Cliff’s delta |
0.11 – < 0.28 |
0.28 – < 0.43 |
≥ 0.43 |
|
Maximum Vargha and Delaney’s A |
0.56 – < 0.64 > 0.34 – 0.44 |
0.64 – < 0.71 > 0.29 – 0.34 |
≥ 0.71 ≤ 0.29 |
epsilon-squared
library(rcompanion)
epsilonSquared(x = Data$Likert,
g = Data$Speaker)
epsilon.squared
0.581
Freeman’s theta
library(rcompanion)
freemanTheta(x = Data$Likert,
g = Data$Speaker)
Freeman.theta
0.64
Maximum Vargha and Delaney’s A or Cliff’s delta
Here, the multiVDA function is used to calculate Vargha and Delaney’s A (VDA), Cliff’s delta (CD), and r between all pairs of groups. The function identifies the comparison with the most extreme VDA statistic (0.95 for Pooh – Piglet). That is, it identifies the most disparate groups.
source("http://rcompanion.org/r_script/multiVDA.r")
library(rcompanion)
library(coin)
multiVDA(x = Data$Likert,
g = Data$Speaker)
$pairs
Comparison VDA CD r VDA.m CD.m r.m
1 Pooh - Piglet = 0 0.95 0.90 0.791 0.95 0.90 0.791
2 Pooh - Tigger = 0 0.58 0.16 0.154 0.58 0.16 0.154
3 Piglet - Tigger = 0 0.07 -0.86 -0.756 0.93 0.86 0.756
$comparison
Comparison
"Pooh - Piglet = 0"
$statistic
VDA
0.95
$statistic.m
VDA.m
0.95
Post-hoc test: Dunn test for multiple comparisons of groups
If the Kruskal–Wallis test is significant, a post-hoc analysis can be performed to determine which groups differ from each other group.
Probably the most popular post-hoc test for the Kruskal–Wallis test is the Dunn test. The Dunn test can be conducted with the dunnTest function in the FSA package.
Because the post-hoc test will produce multiple p-values, adjustments to the p-values can be made to avoid inflating the possibility of making a type-I error. There are a variety of methods for controlling the familywise error rate or for controlling the false discovery rate. See ?p.adjust for details on these methods.
When there are many p-values to evaluate, it is useful to condense a table of p-values to a compact letter display format. In the output, groups are separated by letters. Groups sharing the same letter are not significantly different. Compact letter displays are a clear and succinct way to present results of multiple comparisons.
### Order groups by median
Data$Speaker = factor(Data$Speaker,
levels=c("Pooh", "Tigger",
"Piglet"))
levels(Data$Speaker)
### Dunn test
library(FSA)
DT = dunnTest(Likert ~ Speaker,
data=Data,
method="bh") # Adjusts
p-values for multiple comparisons;
# See ?dunnTest
for options
DT
Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Benjamini-Hochberg method.
Comparison Z P.unadj P.adj
1 Pooh - Tigger 0.4813074 0.6302980448 0.6302980448
2 Pooh - Piglet 3.7702412 0.0001630898 0.0004892695
3 Tigger - Piglet 3.2889338 0.0010056766 0.0015085149
### Compact letter display
PT = DT$res
PT
library(rcompanion)
cldList(P.adj ~ Comparison,
data = PT,
threshold = 0.05)
Group Letter MonoLetter
1 Pooh a a
2 Tigger a a
3 Piglet b b
Groups sharing a letter not signficantly different
(alpha = 0.05).
Post-hoc test: pairwise Mann–Whitney U-tests for multiple comparisons
I don’t recommend using pairwise Mann–Whitney U-tests for post-hoc
testing for the Kruskal–Wallis test, but the following example shows how this
can be done.
The pairwise.wilcox.test function produces a table of p-values comparing
each pair of groups.
To prevent the inflation of type I error rates, adjustments to the p-values
can be made using the p.adjust.method option. Here the fdr
method is used. See ?p.adjust for details on available p-value
adjustment methods.
When there are many p-values to evaluate, it is useful to condense a
table of p-values to a compact letter display format. This can be
accomplished with a combination of the fullPTable function in the rcompanion
package and the multcompLetters function in the multcompView
package.
In a compact letter display, groups sharing the same letter are not
significantly different.
Here the fdr p-value adjustment method is used. See ?p.adjust
for details on available methods.
The code creates a matrix of p-values called PT, then converts
this to a fuller matrix called PT1. PT1 is then passed to the multcompLetters
function to be converted to a compact letter display.
Note that the p-value results of the pairwise Mann–Whitney U-tests differ somewhat from those of the Dunn test.
### Order groups by median
Data$Speaker = factor(Data$Speaker,
levels=c("Pooh", "Tigger", "Piglet"))
Data
### Pairwise Mann–Whitney
PT = pairwise.wilcox.test(Data$Likert,
Data$Speaker,
p.adjust.method="fdr")
# Adjusts p-values for
multiple comparisons;
# See ?p.adjust for
options
PT
Pairwise comparisons using Wilcoxon rank sum test
Pooh Tigger
Tigger 0.5174 -
Piglet 0.0012 0.0012
P value adjustment method: fdr
### Note that the values in the table are p-values
comparing each
### pair of groups.
### Convert PT to a full table and call
it PT1
PT = PT$p.value ### Extract p-value table
library(rcompanion)
PT1 = fullPTable(PT)
PT1
Pooh Tigger Piglet
Pooh 1.000000000 0.517377650 0.001241095
Tigger 0.517377650 1.000000000 0.001241095
Piglet 0.001241095 0.001241095 1.000000000
### Produce compact letter display
library(multcompView)
multcompLetters(PT1,
compare="<",
threshold=0.05, # p-value to use
as significance threshold
Letters=letters,
reversed = FALSE)
Pooh Tigger Piglet
"a" "a" "b"
### Values sharing a letter are not significantly
different
Plot of medians and confidence intervals
The following code uses the groupwiseMedian function to produce a data frame of medians for each speaker along with the 95% confidence intervals for each median with the percentile method. These medians are then plotted, with their confidence intervals shown as error bars. The grouping letters from the multiple comparisons (Dunn test or pairwise Mann–Whitney U-tests) are added.
Note that bootstrapped confidence intervals may not be reliable for discreet data, such as the ordinal Likert data used in these examples, especially for small samples.
library(rcompanion)
Sum = groupwiseMedian(Likert ~ Speaker,
data = Data,
conf = 0.95,
R = 5000,
percentile = TRUE,
bca = FALSE,
digits = 3)
Sum
Speaker n Median Conf.level Percentile.lower Percentile.upper
1 Pooh 10 4 0.95 4.0 5.0
2 Piglet 10 2 0.95 2.0 3.0
3 Tigger 10 4 0.95 3.5 4.5
X = 1:3
Y = Sum$Percentile.upper + 0.2
Label = c("a", "b", "a")
library(ggplot2)
ggplot(Sum, ### The data frame to
use.
aes(x = Speaker,
y = Median)) +
geom_errorbar(aes(ymin = Percentile.lower,
ymax = Percentile.upper),
width = 0.05,
size = 0.5) +
geom_point(shape = 15,
size = 4) +
theme_bw() +
theme(axis.title = element_text(face = "bold")) +
ylab("Median Likert score") +
annotate("text",
x = X,
y = Y,
label = Label)

Plot of median Likert score versus Speaker. Error bars indicate the 95% confidence
intervals for the median with the percentile method.
References
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Routledge.
Vargha, A. and H.D. Delaney. A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong. 2000. Journal of Educational and Behavioral Statistics 25(2):101–132.
Exercises L
1. Considering Pooh, Piglet, and Tigger’s data,
a. What was the median score for each instructor?
b. According to the Kruskal–Wallis test, is there a statistical
difference in scores among the instructors?
c. What is the value of maximum Vargha and Delaney’s A for
these data?
d. How do you interpret this value? (What does it mean? And is
the standard interpretation in terms of “small”, “medium”, or “large”?)
e. Looking at the post-hoc analysis, which speakers’ scores
are statistically different from which others? Who had the statistically
highest scores?
f. How would you summarize the results of the descriptive statistics and tests? Include practical considerations of any differences.
2. Brian, Stewie, and Meg want to assess the education level of students in
their courses on creative writing for adults. They want to know the median
education level for each class, and if the education level of the classes were
different among instructors.
They used the following table to code his data.
Code Abbreviation Level
1 < HS Less than high school
2 HS High school
3 BA Bachelor’s
4 MA Master’s
5 PhD Doctorate
The following are the course data.
Instructor Student Education
'Brian Griffin' a 3
'Brian Griffin' b 2
'Brian Griffin' c 3
'Brian Griffin' d 3
'Brian Griffin' e 3
'Brian Griffin' f 3
'Brian Griffin' g 4
'Brian Griffin' h 5
'Brian Griffin' i 3
'Brian Griffin' j 4
'Brian Griffin' k 3
'Brian Griffin' l 2
'Stewie Griffin' m 4
'Stewie Griffin' n 5
'Stewie Griffin' o 4
'Stewie Griffin' p 4
'Stewie Griffin' q 4
'Stewie Griffin' r 4
'Stewie Griffin' s 3
'Stewie Griffin' t 5
'Stewie Griffin' u 4
'Stewie Griffin' v 4
'Stewie Griffin' w 3
'Stewie Griffin' x 2
'Meg Griffin' y 3
'Meg Griffin' z 4
'Meg Griffin' aa 3
'Meg Griffin' ab 3
'Meg Griffin' ac 3
'Meg Griffin' ad 2
'Meg Griffin' ae 3
'Meg Griffin' af 4
'Meg Griffin' ag 2
'Meg Griffin' ah 3
'Meg Griffin' ai 2
'Meg Griffin' aj 1
For each of the following, answer the question, and show the output from the analyses you used to answer the question.
a. What was the median education level for each instructor’s class? (Be sure to report the education level, not just the numeric code!)
b. According to the Kruskal–Wallis test, is there a difference
in the education level of students among the instructors?
c. What is the value of maximum Vargha and Delaney’s A for
these data?
d. How do you interpret this value? (What does it mean? And is
the standard interpretation in terms of “small”, “medium”, or “large”?)
e. Looking at the post-hoc analysis, which classes education levels are statistically different from which others? Who had the statistically highest education level?
f. Plot Brian, Stewie, and Meg’s data in a way that helps you
visualize the data. Do the results reflect what you would expect from looking
at the plot?
g. How would you summarize the results of the descriptive statistics and tests? What do you conclude practically?