When to use this test
The twosample Mann–Whitney U test compares values for two groups. A significant result suggests that the values for the two groups are different. It is equivalent to a twosample Wilcoxon ranksum test.
In the context of this book, the test is useful to compare the scores or ratings from two speakers, two different presentations, or two groups of audiences.
If the shape and spread of the distributions of values of each group is similar, then the test compares the medians of the two groups. Otherwise, the test is really testing if there is a systematic difference in the values of the two groups. This is sometimes stated as testing if one sample has stochastic dominance compared with the other.
The test assumes that the observations are independent. That is, it is not appropriate for paired observations or repeated measures data.
The test is performed with the wilcox.test function.
Appropriate data
• Twosample data. That is, oneway data with two groups only
• Dependent variable is ordinal, interval, or ratio
• Independent variable is a factor with two levels. That is, two groups
• Observations between groups are independent. That is, not paired or repeated measures data
• In order to be a test of medians, the distributions of values for each group need to be of similar shape and spread; outliers affect the spread. Otherwise the test is a test of stochastic equality.
Hypotheses
If the distributions of the two groups are similar in shape and spread:
• Null hypothesis: The medians of values for each group are equal.
• Alternative hypothesis (twosided): The medians of values for each group are not equal.
If the distributions of the two groups are not similar in shape and spread:
• Null hypothesis: The two groups exhibit stochastic equality.
• Alternative hypothesis (twosided): The two groups do not exhibit stochastic equality.
Interpretation
If the distributions of the two groups are similar in shape:
Significant results can be reported as e.g. “The median value of group A was significantly different from that of group B.”
If the distributions of the two groups are not similar in shape:
Significant results can be reported as e.g. “Values for group A were significantly different from those for group B.”
Other notes and alternative tests
The Mann–Whitney U test can be considered equivalent to the Kruskal–Wallis test with only two groups.
Mood’s median test compares the medians of two groups. It is described in the next chapter.
Another alternative is to use cumulative link models for ordinal data, which are described later in this book.
Packages used in this chapter
The packages used in this chapter include:
• psych
• FSA
• lattice
• rcompanion
• coin
• DescTools
• effsize
The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(FSA)){install.packages("FSA")}
if(!require(lattice)){install.packages("lattice")}
if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(coin)){install.packages("coin")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(effsize)){install.packages("effsize")}
Twosample Mann–Whitney U test example
This example revisits the Pooh and Piglet data from the Descriptive Statistics with the likert Package chapter.
It answers the question, “Are Pooh's scores significantly different from those of Piglet?”
The Mann–Whitney U test is conducted with the wilcox.test function, which produces a pvalue for the hypothesis. First the data are summarized and examined using bar plots for each group.
Because the bar plots show that the distributions of scores for Pooh and Piglet are relatively similar in shape, the Mann–Whitney U test can be interpreted as a test of medians.
Input =("
Speaker Likert
Pooh 3
Pooh 5
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 4
Pooh 5
Pooh 5
Piglet 2
Piglet 4
Piglet 2
Piglet 2
Piglet 1
Piglet 2
Piglet 3
Piglet 2
Piglet 2
Piglet 3
")
Data = read.table(textConnection(Input),header=TRUE)
### Create a new variable which is the Likert scores as an ordered factor
Data$Likert.f = factor(Data$Likert,
ordered = TRUE)
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Summarize data treating Likert scores as factors
Note that the variable we want to count is Likert.f, which is a factor variable. Counts for Likert.f are cross tabulated over values of Speaker. The prop.table function translates a table into proportions. The margin=1 option indicates that the proportions are calculated for each row.
xtabs( ~ Speaker + Likert.f,
data = Data)
Likert.f
Speaker 1 2 3 4 5
Piglet 1 6 2 1 0
Pooh 0 0 1 6 3
XT = xtabs( ~ Speaker + Likert.f,
data = Data)
prop.table(XT,
margin = 1)
Likert.f
Speaker 1 2 3 4 5
Piglet 0.1 0.6 0.2 0.1 0.0
Pooh 0.0 0.0 0.1 0.6 0.3
Bar plots of data by group
library(lattice)
histogram(~ Likert.f  Speaker,
data=Data,
layout=c(1,2) # columns and rows of
individual plots
)
Summarize data treating Likert scores as numeric
library(FSA)
Summarize(Likert ~ Speaker,
data=Data,
digits=3)
Speaker n mean sd min Q1 median Q3 max percZero
1 Piglet 10 2.3 0.823 1 2 2 2.75 4 0
2 Pooh 10 4.2 0.632 3 4 4 4.75 5 0
Twosample Mann–Whitney U test example
This example uses the formula notation indicating that Likert is the dependent variable and Speaker is the independent variable. The data= option indicates the data frame that contains the variables. For the meaning of other options, see ?wilcox.test.
wilcox.test(Likert ~ Speaker,
data=Data)
Wilcoxon rank sum test with continuity correction
W = 5, pvalue = 0.0004713
alternative hypothesis: true location shift is not equal to 0
### You may get a "cannot compute exact pvalue
with ties" error.
### You can ignore this or use the exact=FALSE option.
Effect size
Statistics of effect size for the Mann–Whitney test provide the degree to which one group has data with higher ranks than the other group. They are related to the probability that a value from one group will be greater than a value from the other group. Unlike pvalues, they are not affected by sample size. They are standardized to range from 0 to 1. An effect size of 0 indicates that there is no effect; that is, that the two groups are absolutely stochastically equal. An effect size of 1 indicates that the measurements for one group are entirely greater than the for the other group.
Probably the most common effect size statistic for the Mann–Whitney test is r, which is the Z value from the test divided by the total number of observations. As written here, r varies from 0 to 1. In some formulations, it varies from –1 to 1.
Cliff’s delta is another effect size statistic that is sometimes encountered. It ranges from –1 to 1, with 0 indicating stochastic equality of the two groups. Its absolute value will be numerically equal to Freeman’s theta, and it is linearly related to Vargha and Delaney’s A.
Kendall’s taub is sometimes used, and varies from approximately –1 to 1.
Freeman’s theta and epsilonsquared are usually used when there are more than two groups, with the Kruskal–Wallis test, but can be employed in the case of two groups.
Vargha and Delaney’s A is linearly related to Cliff’s delta, and expresses the probability that a value from one group will be greater than a value from the other group. A value of 0.50 indicates that the two groups are stochastically equal. A value of 1 indicates that one group shows complete stochastic domination over the other group, and a value of 0 indicates the complete stochastic domination of the other group.
Interpretation of effect sizes necessarily varies by discipline and the expectations of the experiment, but for behavioral studies, the guidelines proposed by Cohen (1988) are sometimes followed. The following guidelines are based on the literature values and my personal intuition. They should not be considered universal.
Technical note: The interpretation values for r below are found commonly in published literature and on the internet. I suspect that this interpretation stems from the adoption of Cohen’s interpretation of values for Pearson’s r. This may not be justified, but it turns out that this interpretation for the r used here is relatively reasonable. The interpretation for taub, Freeman’s theta, and epsilonsquared here are based on their values relative to those for r, based on simulated data (5point Likert items, n per group between 4 and 25). Plots for some of these simulations are shown below.
Interpretation for Vargha and Delaney’s A comes from Vargha and Delaney (2000). The values for Cliff’s delta are commonly cited on the internet, but I don’t know their origin.

small

medium 
large 
r 
0.10 – < 0.30 
0.30 – < 0.50 
≥ 0.50 
taub 
0.10 – < 0.30 
0.30 – < 0.50 
≥ 0.50 
Cliff’s delta 
0.147 – < 0.330 
0.330 – < 0.474 
≥ 0.474 
Vargha and Delaney’s A 
0.56 – < 0.64 > 0.34 – 0.44 
0.64 – < 0.71 > 0.29 – 0.34 
≥ 0.71 ≤ 0.29 
Freeman’s theta 
0.11 – < 0.34 
0.34 – < 0.58 
≥ 0.58 
epsilonsquared 
0.01 – < 0.08 
0.08 – < 0.26 
≥ 0.26 
r
library(rcompanion)
wilcoxonR(x = Data$Likert,
g = Data$Speaker)
r
0.791
taub
library(DescTools)
KendallTauB(x = Data$Likert,
y = as.numeric(Data$Speaker))
[1] 0.7397954
Cliff’s delta
library(effsize)
cliff.delta(d = Data$Likert,
f = Data$Speaker)
Cliff's Delta
delta estimate: 0.9 (large)
Vargha and Delaney’s A
library(effsize)
VD.A(d = Data$Likert,
f = Data$Speaker)
Vargha and Delaney A
A estimate: 0.05 (large)
Freeman’s theta
library(rcompanion)
freemanTheta(x = Data$Likert,
g = Data$Speaker)
Freeman.theta
0.9
epsilonsquared
library(rcompanion)
epsilonSquared(x = Data$Likert,
g = Data$Speaker)
epsilon.squared
0.658
r
library(rcompanion)
wilcoxonR(x = Data$Likert,
g = Data$Speaker)
r
0.791
taub
library(DescTools)
KendallTauB(x = Data$Likert,
y = as.numeric(Data$Speaker))
[1] 0.7397954
Optional: Comparison among effect size statistics
The follow plots show the relationship among effect size statistics discussed in this chapter. Data were 5point Likert item responses, with n per group between 4 and 25.
Freeman’s theta was mostly linearly related to r, with variation depending on sample size and data values. In the second figure below, the colors indicate interpretation of lessthansmall, small, medium, and large as the blue becomes darker.
The relationship of epsilonsquared and Freeman’s theta was curvilineal, with variation depending on sample size and data values. In the second figure below, the colors indicate interpretation of lessthansmall, small, medium, and large as the blue becomes darker
Kendall’s taub was relatively closely linearly related to r, up to a value of about 0.88. In the second figure below, the colors indicate interpretation of lessthansmall, small, medium, and large as the blue becomes darker
References
Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Routledge.
Vargha, A. and H.D. Delaney. A Critique and Improvement of the
CL Common Language Effect Size Statistics of McGraw and Wong. 2000. Journal
of Educational and Behavioral Statistics 25(2):101–132.
Exercises J
1. Considering Pooh and Piglet’s data,
a. What was the median score for each instructor?
b. What were the first and third quartiles for each instructor’s
scores?
c. Are the data for both instructors reasonably similar in
shape and spread?
d. Based on your previous answer, what is the null hypothesis
for the Mann–Whitney test?
e. According to the Mann–Whitney test, is there a difference
in scores between the instructors?
f. What was the value of r for the effect
size for these data?
g. How do you interpret this value?
h. How would you summarize the results of the descriptive statistics and tests? Include practical considerations of any differences.
2. Brian and Stewie Griffin want to assess the education level of students in
their courses on creative writing for adults. They want to know the median
education level for each class, and if the education level of the classes were
different between instructors.
They used the following table to code his data.
Code Abbreviation Level
1 < HS Less than high school
2 HS High school
3 BA Bachelor’s
4 MA Master’s
5 PhD Doctorate
The following are the course data.
Instructor Student Education
'Brian Griffin' a 3
'Brian Griffin' b 2
'Brian Griffin' c 3
'Brian Griffin' d 3
'Brian Griffin' e 3
'Brian Griffin' f 3
'Brian Griffin' g 4
'Brian Griffin' h 5
'Brian Griffin' i 3
'Brian Griffin' j 4
'Brian Griffin' k 3
'Brian Griffin' l 2
'Stewie Griffin' m 4
'Stewie Griffin' n 5
'Stewie Griffin' o 4
'Stewie Griffin' p 4
'Stewie Griffin' q 4
'Stewie Griffin' r 4
'Stewie Griffin' s 3
'Stewie Griffin' t 5
'Stewie Griffin' u 4
'Stewie Griffin' v 4
'Stewie Griffin' w 3
'Stewie Griffin' x 2
For each of the following, answer the question, and show the output from the analyses you used to answer the question.
a. What was the median education level for each instructor's class? (Be sure to report the education level, not just the numeric code!)
b. What were the first and third quartiles for education
level for each instructor's class?
c. Are the data for both instructors similar in shape and
spread?
d. Based on your previous answer, what is the null hypothesis
for the Mann–Whitney test?
e. According to the Mann–Whitney test, is there a difference in education level between the instructors' classes?
f. What was the value of r for the effect
size for these data?
g. How do you interpret this value?
h. Plot Brian and Stewie’s data in a way that helps you visualize the data. Do the results reflect what you would expect from looking at the plot?
i. How would you summarize the results of the descriptive statistics and tests? Include your practical interpretation.