The Kruskal–Wallis test is a rankbased test that is similar to the Mann–Whitney U test, but can be applied to oneway data with more than two groups. The test is useful to compare the scores or ratings from multiple speakers, presentations, or groups of audiences.
If the shape and spread of the distributions of values of each group is similar, then the test compares the medians of the two groups. Otherwise, the test is really testing if there is a systematic difference in the values among the groups.
The test is performed with the kruskal.test function.
Posthoc tests
The outcome of the Kruskal–Wallis test tells you if there are differences among the groups, but doesn’t tell you which groups are different from other groups. In order to determine which groups are different from others, posthoc testing can be conducted. Probably the most common posthoc test for the Kruskal–Wallis test is the Dunn test, here conducted with the dunnTest function in the FSA package. An alternative to this is to conduct Mann–Whitney tests on each pair of groups. This is accomplished with pairwise.wilcox.test function.
Appropriate data
• Oneway data
• Dependent variable is ordinal, interval, or ratio
• Independent variable is a factor with two or more levels. That is, two or more groups
• Observations between groups are independent. That is, not paired or repeated measures data
• In order to be a test of medians, the distributions of values for each group need to be of similar shape and spread. Otherwise the test is a test of stochastic equality.
Hypotheses
If the distributions of the two groups are similar in shape and spread:
• Null hypothesis: The medians of values for each group are equal.
• Alternative hypothesis (twosided): The medians of values for each group are not equal.
If the distributions of the two groups are not similar in shape and spread:
• Null hypothesis: The groups exhibit stochastic equality.
• Alternative hypothesis (twosided): The groups do not exhibit stochastic equality.
Interpretation
If the distributions of the two groups are similar in shape and spread:
Significant results can be reported as “There was a significant difference in median values across groups.”
Posthoc analysis allows you to say “The median for group A was higher than the median for group B”, and so on.
If the distributions of the two groups are not similar in shape and spread:
Significant results can be reported as “There was a significant difference in values among groups.”
Other notes and alternative tests
Mood’s median test compares the medians of groups.
Packages used in this chapter
The packages used in this chapter include:
• psych
• FSA
• lattice
• multcompView
• rcompanion
The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(FSA)){install.packages("FSA")}
if(!require(lattice)){install.packages("lattice")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(rcompanion)){install.packages("rcompanion")}
Kruskal–Wallis test example
This example revisits the Pooh, Piglet, and Tigger data from the Descriptive Statistics with the likert Package chapter.
It answers the question, “Are the scores significantly different among the three speakers?”
The Kruskal–Wallis test is conducted with the kruskal.test function, which produces a pvalue for the hypothesis. First the data are summarized and examined using bar plots for each group.
Note that because the bar plot show that the distributions of scores for each of the speakers are relatively similar in shape and spread, the Kruskal–Wallis test can be interpreted as a test of medians.
Input =("
Speaker Likert
Pooh 3
Pooh 5
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 5
Pooh 5
Piglet 2
Piglet 4
Piglet 2
Piglet 2
Piglet 1
Piglet 2
Piglet 3
Piglet 2
Piglet 2
Piglet 3
Tigger 4
Tigger 4
Tigger 4
Tigger 4
Tigger 5
Tigger 3
Tigger 5
Tigger 4
Tigger 4
Tigger 3
")
Data = read.table(textConnection(Input),header=TRUE)
### Order levels of the factor; otherwise R will alphabetize them
Data$Speaker = factor(Data$Speaker,
levels=unique(Data$Speaker))
### Create a new variable which is the likert scores as an ordered factor
Data$Likert.f = factor(Data$Likert,
ordered = TRUE)
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Summarize data treating Likert scores as factors
xtabs( ~ Speaker + Likert.f,
data = Data)
Likert.f
Speaker 1 2 3 4 5
Pooh 0 0 1 6 3
Piglet 1 6 2 1 0
Tigger 0 0 2 6 2
XT = xtabs( ~ Speaker + Likert.f,
data = Data)
prop.table(XT,
margin = 1)
Likert.f
Speaker 1 2 3 4 5
Pooh 0.0 0.0 0.1 0.6 0.3
Piglet 0.1 0.6 0.2 0.1 0.0
Tigger 0.0 0.0 0.2 0.6 0.2
Bar plots of data by group
library(lattice)
histogram(~ Likert.f  Speaker,
data=Data,
layout=c(1,3) # columns and rows of
individual plots
)
Summarize data treating Likert scores as numeric
library(FSA)
Summarize(Likert ~ Speaker,
data=Data,
digits=3)
Speaker n mean sd min Q1 median Q3 max percZero
1 Pooh 10 4.2 0.632 3 4 4 4.75 5 0
2 Piglet 10 2.3 0.823 1 2 2 2.75 4 0
3 Tigger 10 4.0 0.667 3 4 4 4.00 5 0
Kruskal–Wallis test example
This example uses the formula notation indicating that Likert is the dependent variable and Speaker is the independent variable. The data= option indicates the data frame that contains the variables. For the meaning of other options, see ?kruskal.test.
kruskal.test(Likert ~ Speaker,
data = Data)
KruskalWallis rank sum test
KruskalWallis chisquared = 16.842, df = 2, pvalue = 0.0002202
Effect size
Statistics of effect size for the Kruskal–Wallis test provide the degree to which one group has data with higher ranks than another group. They are related to the probability that a value from one group will be greater than a value from another group. Unlike pvalues, they are not affected by sample size. They are standardized to range from 0 to 1. An effect size of 0 indicates that there is no effect; that is, that the groups are absolutely stochastically equal. For Freeman’s theta, an effect size of 1 indicates that the measurements for each group are entirely greater or entirely less than some other group.
Appropriate effect size statistics for the Kruskal–Wallis test include Freeman’s theta and epsilonsquared. epsilonsquared is probably the most common.
Another option is to use the maximum Cliff’s delta or Vargha and Delaney’s A from pairwise comparisons of all groups.
Interpretation of effect sizes necessarily varies by discipline and the expectations of the experiment. The following guidelines are based on my personal intuition or published values. They should not be considered universal.
Technical note: The values for the interpretations for Freeman’s theta to epsilonsquared below were derived by keeping the interpretation for epsilonsquared constant and equal to that for the Mann–Whitney test. Interpretation values for Freeman’s theta were determined through comparing Freeman’s theta to epsilonsquared for simulated data (5point Likert items, n per group between 4 and 25).
Interpretation for Vargha and Delaney’s A comes from Vargha and Delaney (2000). The values for Cliff’s delta are commonly cited on the internet, but I don’t know their origin.

small

medium 
large 
epsilonsquared 
0.01 – < 0.08 
0.08 – < 0.26 
≥ 0.26 
Freeman’s theta, k = 2 
0.11 – < 0.34 
0.34 – < 0.58 
≥ 0.58 
Freeman’s theta, k = 3 
0.05 – < 0.26 
0.26 – < 0.46 
≥ 0.46 
Freeman’s theta, k = 5 
0.05 – < 0.21 
0.21 – < 0.40 
≥ 0.40 
Freeman’s theta, k = 7 
0.05 – < 0.20 
0.20 – < 0.38 
≥ 0.38 
Freeman’s theta, k = 7 
0.05 – < 0.20 
0.20 – < 0.38 
≥ 0.38 
Maximum Cliff’s delta 
147 – < 0.330 
0.330 – < 0.474 
≥ 0.474 
Maximum Vargha and Delaney’s A 
0.56 – < 0.64 > 0.34 – 0.44 
0.64 – < 0.71 > 0.29 – 0.34 
≥ 0.71 ≤ 0.29 
epsilonsquared
library(rcompanion)
epsilonSquared(x = Data$Likert,
g = Data$Speaker)
epsilon.squared
0.581
Freeman’s theta
library(rcompanion)
freemanTheta(x = Data$Likert,
g = Data$Speaker)
Freeman.theta
0.64
Maximum Cliff’s delta
source("http://rcompanion.org/r_script/multiVDA.r")
library(rcompanion)
library(coin)
multiVDA(x = Data$Likert,
g = Data$Speaker,
statistic = "CDA")
$statistic
CDA
0.9
Maximum Vargha and Delaney’s A
source("http://rcompanion.org/r_script/multiVDA.r")
library(rcompanion)
library(coin)
multiVDA(x = Data$Likert,
g = Data$Speaker,
statistic = "VDAH")
$statistic
VDAH
0.95
Posthoc test: Dunn test for multiple comparisons of groups
If the Kruskal–Wallis test is significant, a posthoc analysis can be performed to determine which groups differ from each other group.
Probably the most popular hosthoc test for the Kruskal–Wallis test is the Dunn test. The Dunn test can be conducted with the dunnTest function in the FSA package.
Because the posthoc test will produce multiple pvalues, adjustments to the pvalues can be made to avoid inflating the possibility of making a typeI error. There are a variety of methods for controlling the familywise error rate or for controlling the false discovery rate. See ?p.adjust for details on these methods.
When there are many pvalues to evaluate, it is useful to condense a table of pvalues to a compact letter display format. In the output, groups are separated by letters. Groups sharing the same letter are not significantly different. Compact letter displays are a clear and succinct way to present results of multiple comparisons.
### Order groups by median
Data$Speaker = factor(Data$Speaker,
levels=c("Pooh", "Tigger",
"Piglet"))
levels(Data$Speaker)
### Dunn test
library(FSA)
DT = dunnTest(Likert ~ Speaker,
data=Data,
method="bh") # Adjusts
pvalues for multiple comparisons;
# See ?dunnTest
for options
DT
Dunn (1964) KruskalWallis multiple comparison
pvalues adjusted with the BenjaminiHochberg method.
Comparison Z P.unadj P.adj
1 Pooh  Tigger 0.4813074 0.6302980448 0.6302980448
2 Pooh  Piglet 3.7702412 0.0001630898 0.0004892695
3 Tigger  Piglet 3.2889338 0.0010056766 0.0015085149
### Compact letter display
PT = DT$res
PT
library(rcompanion)
cldList(P.adj ~ Comparison,
data = PT,
threshold = 0.05)
Group Letter MonoLetter
1 Pooh a a
2 Tigger a a
3 Piglet b b
Groups sharing a letter not signficantly different
(alpha = 0.05).
Posthoc test: pairwise Mann–Whitney Utests for multiple comparisons
Another approach to posthoc testing for the Kruskal–Wallis
test is to use Mann–Whitney Utests for each pair of groups.
This can be conducted with the pairwise.wilcox.test function. This
produces a table of pvalues comparing each pair of groups.
To prevent the inflation of type I error rates, adjustments to the pvalues
can be made using the p.adjust.method option. Here the fdr
method is used. See ?p.adjust for details on available pvalue
adjustment methods.
When there are many pvalues to evaluate, it is useful to condense a
table of pvalues to a compact letter display format. This can be
accomplished with a combination of the fullPTable function in the rcompanion
package and the multcompLetters function in the multcompView
package.
In a compact letter display, groups sharing the same letter are not
significantly different.
Here the fdr pvalue adjustment method is used. See ?p.adjust
for details on available methods.
The code creates a matrix of pvalues called PT, then converts
this to a fuller matrix called PT1. PT1 is then passed to the multcompLetters
function to be converted to a compact letter display.
Note that the pvalue results of the pairwise Mann–Whitney Utests differ somewhat from those of the Dunn test.
### Order groups by median
Data$Speaker = factor(Data$Speaker,
levels=c("Pooh", "Tigger",
"Piglet"))
Data
### Pairwise Mann–Whitney
PT = pairwise.wilcox.test(Data$Likert,
Data$Speaker,
p.adjust.method="fdr")
# Adjusts pvalues for
multiple comparisons;
# See ?p.adjust for
options
PT
Pairwise comparisons using Wilcoxon rank sum test
Pooh Tigger
Tigger 0.5174 
Piglet 0.0012 0.0012
P value adjustment method: fdr
### Note that the values in the table are pvalues
comparing each
### pair of groups.
### Convert PT to a full table and call
it PT1
PT = PT$p.value ### Extract pvalue table
library(rcompanion)
PT1 = fullPTable(PT)
PT1
Pooh Tigger Piglet
Pooh 1.000000000 0.517377650 0.001241095
Tigger 0.517377650 1.000000000 0.001241095
Piglet 0.001241095 0.001241095 1.000000000
### Produce compact letter display
library(multcompView)
multcompLetters(PT1,
compare="<",
threshold=0.05, # pvalue to use
as significance threshold
Letters=letters,
reversed = FALSE)
Pooh Tigger Piglet
"a" "a" "b"
### Values sharing a letter are not significantly
different
Plot of medians and confidence intervals
The following code uses the groupwiseMedian function to produce a data frame of medians for each speaker along with the 95% confidence intervals for each median with the percentile method.
These medians are then plotted, with their confidence intervals shown as error bars. The grouping letters from the multiple comparisons (Dunn test or pairwise Mann–Whitney Utests) are added.
library(rcompanion)
Sum = groupwiseMedian(Likert ~ Speaker,
data = Data,
conf = 0.95,
R = 5000,
percentile = TRUE,
bca = FALSE,
digits = 3)
Sum
Speaker n Median Conf.level Percentile.lower Percentile.upper
1 Pooh 10 4 0.95 4.0 5.0
2 Piglet 10 2 0.95 2.0 3.0
3 Tigger 10 4 0.95 3.5 4.5
X = 1:3
Y = Sum$Percentile.upper + 0.2
Label = c("a", "b", "a")
library(ggplot2)
ggplot(Sum, ### The data frame to
use.
aes(x = Speaker,
y = Median)) +
geom_errorbar(aes(ymin = Percentile.lower,
ymax = Percentile.upper),
width = 0.05,
size = 0.5) +
geom_point(shape = 15,
size = 4) +
theme_bw() +
theme(axis.title = element_text(face = "bold")) +
ylab("Median Likert score") +
annotate("text",
x = X,
y = Y,
label = Label)
Plot of median Likert score versus Speaker. Error bars indicate the 95% confidence
intervals for the median with the percentile method.
References
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Routledge.
Vargha, A. and H.D. Delaney. A Critique and Improvement of the
CL Common Language Effect Size Statistics of McGraw and Wong. 2000. Journal
of Educational and Behavioral Statistics 25(2):101–132.
Exercises L
1. Considering Pooh, Piglet, and Tigger’s data,
a. What was the median score for each instructor?
b. Are the data for all instructors reasonably
similar shape and spread?
c. Based on your previous answer, what is the null hypothesis
for the Kruskal–Wallis test?
d. According to the Kruskal–Wallis test, is there a statistical
difference in scores among the instructors?
e. What is the value of epsilonsquared for these data?
f. How do you interpret this effect size?
g. Looking at the posthoc analysis, which speakers’ scores
are statistically different from which others? Who had the statistically
highest scores?
h. What do you conclude practically?
2. Brian, Stewie, and Meg want to assess the education level of students in
their courses on creative writing for adults. They want to know the median
education level for each class, and if the education level of the classes were
different among instructors.
They used the following table to code his data.
Code Abbreviation Level
1 < HS Less than high school
2 HS High school
3 BA Bachelor’s
4 MA Master’s
5 PhD Doctorate
The following are the course data.
Instructor Student Education
'Brian Griffin' a 3
'Brian Griffin' b 2
'Brian Griffin' c 3
'Brian Griffin' d 3
'Brian Griffin' e 3
'Brian Griffin' f 3
'Brian Griffin' g 4
'Brian Griffin' h 5
'Brian Griffin' i 3
'Brian Griffin' j 4
'Brian Griffin' k 3
'Brian Griffin' l 2
'Stewie Griffin' m 4
'Stewie Griffin' n 5
'Stewie Griffin' o 4
'Stewie Griffin' p 4
'Stewie Griffin' q 4
'Stewie Griffin' r 4
'Stewie Griffin' s 3
'Stewie Griffin' t 5
'Stewie Griffin' u 4
'Stewie Griffin' v 4
'Stewie Griffin' w 3
'Stewie Griffin' x 2
'Meg Griffin' y 3
'Meg Griffin' z 4
'Meg Griffin' aa 3
'Meg Griffin' ab 3
'Meg Griffin' ac 3
'Meg Griffin' ad 2
'Meg Griffin' ae 3
'Meg Griffin' af 4
'Meg Griffin' ag 2
'Meg Griffin' ah 3
'Meg Griffin' ai 2
'Meg Griffin' aj 1
For each of the following, answer the question, and show the output from the analyses you used to answer the question.
a. What was the median education level for each instructor’s class? (Be sure to report the education level, not just the numeric code!)
b. Are the distributions of education levels for all instructors
reasonably similar in shape and spread?
c. Based on your previous answer, what is the null hypothesis
for the Kruskal–Wallis test?
d. According to the Kruskal–Wallis test, is there a difference
in the education level of students among the instructors?
e. What is the value of epsilonsquared for these data?
f. How do you interpret this effect size?
g. Looking at the posthoc analysis, which classes education
levels are different from which others? Who had the statistically highest
education level?
h. Plot Brian, Stewie, and Meg’s data in a way that helps you
visualize the data. Do the results reflect what you would expect from looking
at the plot?
i. How would you summarize the results of the descriptive statistics and tests? What do you conclude practically?