 ## Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

# Kruskal–Wallis Test

The Kruskal–Wallis test is a rank-based test that is similar to the Mann–Whitney U test, but can be applied to one-way data with more than two groups.

Without further assumptions about the distribution of the data, the Kruskal–Wallis test does not address hypotheses about the medians of the groups.  Instead, the test addresses if it is likely that an observation in one group is greater than an observation in the other.  This is sometimes stated as testing if one sample has stochastic dominance compared with the other.

The test assumes that the observations are independent.  That is, it is not appropriate for paired observations or repeated measures data.

It is performed with the kruskal.test function.

Appropriate effect size statistics include maximum Vargha and Delaney’s A, maximum Cliff’s delta, Freeman’s theta, and epsilon-squared.

##### Post-hoc tests

The outcome of the Kruskal–Wallis test tells you if there are differences among the groups, but doesn’t tell you which groups are different from other groups.  In order to determine which groups are different from others, post-hoc testing can be conducted.  Probably the most common post-hoc test for the Kruskal–Wallis test is the Dunn test, here conducted with the dunnTest function in the FSA package.

##### Appropriate data

•  One-way data

•  Dependent variable is ordinal, interval, or ratio

•  Independent variable is a factor with two or more levels.  That is, two or more groups

•  Observations between groups are independent.  That is, not paired or repeated measures data

•  In order to be a test of medians, the distributions of values for each group need to be of similar shape and spread.  Otherwise the test is typically a test of stochastic equality.

##### Hypotheses

•  Null hypothesis:  The groups are sampled from populations with identical distributions.  Typically, that the sampled populations exhibit stochastic equality.

•  Alternative hypothesis (two-sided): The groups are sampled from populations with different distributions.  Typically, that one sampled population exhibits stochastic dominance.

##### Interpretation

Significant results can be reported as “There was a significant difference in values among groups.”

Post-hoc analysis allows you to say “There was a significant difference in values between groups A and B.”, and so on.

##### Other notes and alternative tests

Mood’s median test compares the medians of groups.

### Packages used in this chapter

The packages used in this chapter include:

•  psych

•  FSA

•  lattice

•  multcompView

•  rcompanion

The following commands will install these packages if they are not already installed:

if(!require(psych)){install.packages("psych")}
if(!require(FSA)){install.packages("FSA")}
if(!require(lattice)){install.packages("lattice")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(rcompanion)){install.packages("rcompanion")}

### Kruskal–Wallis test example

This example re-visits the Pooh, Piglet, and Tigger data from the Descriptive Statistics with the likert Package chapter.

It answers the question, “Are the scores significantly different among the three speakers?”

The Kruskal–Wallis test is conducted with the kruskal.test function, which produces a p-value for the hypothesis.  First the data are summarized and examined using bar plots for each group.

Input =("
Speaker  Likert
Pooh      3
Pooh      5
Pooh      4
Pooh      4
Pooh      4
Pooh      4
Pooh      4
Pooh      4
Pooh      5
Pooh      5
Piglet    2
Piglet    4
Piglet    2
Piglet    2
Piglet    1
Piglet    2
Piglet    3
Piglet    2
Piglet    2
Piglet    3
Tigger    4
Tigger    4
Tigger    4
Tigger    4
Tigger    5
Tigger    3
Tigger    5
Tigger    4
Tigger    4
Tigger    3
")

### Order levels of the factor; otherwise R will alphabetize them

Data\$Speaker = factor(Data\$Speaker,
levels=unique(Data\$Speaker))

### Create a new variable which is the likert scores as an ordered factor

Data\$Likert.f = factor(Data\$Likert,
ordered = TRUE)

###  Check the data frame

library(psych)

str(Data)

summary(Data)

### Remove unnecessary objects

rm(Input)

#### Summarize data treating Likert scores as factors

xtabs( ~ Speaker + Likert.f,
data = Data)

Likert.f
Speaker  1 2 3 4 5
Pooh   0 0 1 6 3
Piglet 1 6 2 1 0
Tigger 0 0 2 6 2

XT = xtabs( ~ Speaker + Likert.f,
data = Data)

prop.table(XT,
margin = 1)

Likert.f
Speaker    1   2   3   4   5
Pooh   0.0 0.0 0.1 0.6 0.3
Piglet 0.1 0.6 0.2 0.1 0.0
Tigger 0.0 0.0 0.2 0.6 0.2

#### Bar plots of data by group

library(lattice)

histogram(~ Likert.f | Speaker,
data=Data,
layout=c(1,3)      #  columns and rows of individual plots
) #### Summarize data treating Likert scores as numeric

library(FSA)

Summarize(Likert ~ Speaker,
data=Data,
digits=3)

Speaker  n mean    sd min Q1 median   Q3 max percZero
1    Pooh 10  4.2 0.632   3  4      4 4.75   5        0
2  Piglet 10  2.3 0.823   1  2      2 2.75   4        0
3  Tigger 10  4.0 0.667   3  4      4 4.00   5        0

#### Kruskal–Wallis test example

This example uses the formula notation indicating that Likert is the dependent variable and Speaker is the independent variable.  The data= option indicates the data frame that contains the variables.  For the meaning of other options, see ?kruskal.test.

kruskal.test(Likert ~ Speaker,
data = Data)

Kruskal-Wallis rank sum test

Kruskal-Wallis chi-squared = 16.842, df = 2, p-value = 0.0002202

#### Effect size

Statistics of effect size for the Kruskal–Wallis test provide the degree to which one group has data with higher ranks than another group.  They are related to the probability that a value from one group will be greater than a value from another group.   Unlike p-values, they are not affected by sample size.

Appropriate effect size statistics for the Kruskal–Wallis test include Freeman’s theta and epsilon-squared.  epsilon-squared is probably the most common. For Freeman’s theta, an effect size of 1 indicates that the measurements for each group are entirely greater or entirely less than some other group, and an effect size of 0 indicates that there is no effect; that is, that the groups are absolutely stochastically equal.

Another option is to use the maximum Cliff’s delta or Vargha and Delaney’s A (VDA) from pairwise comparisons of all groups.  VDA is the probability that an observation from one group is greater than an observation from the other group.  Because of this interpretation, VDA is an effect size statistic that is relatively easy to understand.

Interpretation of effect sizes necessarily varies by discipline and the expectations of the experiment.  The following guidelines are based on my personal intuition or published values.  They should not be considered universal.

Technical note: The values for the interpretations for Freeman’s theta to epsilon-squared below were derived by keeping the interpretation for epsilon-squared constant and equal to that for the Mann–Whitney test.  Interpretation values for Freeman’s theta were determined through comparing Freeman’s theta to epsilon-squared for simulated data (5-point Likert items, n per group between 4 and 25).

Interpretations for Vargha and Delaney’s A and Cliff’s delta come from Vargha and Delaney (2000).

 small medium large epsilon-squared 0.01     – < 0.08 0.08     – < 0.26 ≥ 0.26 Freeman’s theta, k = 2 0.11     – < 0.34 0.34     – < 0.58 ≥ 0.58 Freeman’s theta, k = 3 0.05     – < 0.26 0.26     – < 0.46 ≥ 0.46 Freeman’s theta, k = 5 0.05     – < 0.21 0.21     – < 0.40 ≥ 0.40 Freeman’s theta, k = 7 0.05     – < 0.20 0.20     – < 0.38 ≥ 0.38 Freeman’s theta, k = 7 0.05     – < 0.20 0.20     – < 0.38 ≥ 0.38 Maximum Cliff’s delta 0.11     –   < 0.28 0.28     –   < 0.43 ≥ 0.43 Maximum Vargha and Delaney’s A 0.56     –   < 0.64 > 0.34  –   0.44 0.64     –   < 0.71 > 0.29  –   0.34 ≥ 0.71 ≤ 0.29

##### epsilon-squared

library(rcompanion)

epsilonSquared(x = Data\$Likert,
g = Data\$Speaker)

epsilon.squared
0.581

##### Freeman’s theta

library(rcompanion)

freemanTheta(x = Data\$Likert,
g = Data\$Speaker)

Freeman.theta
0.64

##### Maximum Vargha and Delaney’s A or Cliff’s delta

Here, the multiVDA function is used to calculate Vargha and Delaney’s A (VDA), Cliff’s delta (CD), and r between all pairs of groups.  The function identifies the comparison with the most extreme VDA statistic (0.95 for Pooh – Piglet).  That is, it identifies the most disparate groups.

source("http://rcompanion.org/r_script/multiVDA.r")

library(rcompanion)

library(coin)

multiVDA(x = Data\$Likert,
g = Data\$Speaker)

\$pairs
Comparison  VDA    CD      r VDA.m CD.m   r.m
1   Pooh - Piglet = 0 0.95  0.90  0.791  0.95 0.90 0.791
2   Pooh - Tigger = 0 0.58  0.16  0.154  0.58 0.16 0.154
3 Piglet - Tigger = 0 0.07 -0.86 -0.756  0.93 0.86 0.756

\$comparison
Comparison
"Pooh - Piglet = 0"

\$statistic
VDA
0.95

\$statistic.m
VDA.m
0.95

#### Post-hoc test: Dunn test for multiple comparisons of groups

If the Kruskal–Wallis test is significant, a post-hoc analysis can be performed to determine which groups differ from each other group.

Probably the most popular post-hoc test for the Kruskal–Wallis test is the Dunn test.  The Dunn test can be conducted with the dunnTest function in the FSA package.

Because the post-hoc test will produce multiple p-values, adjustments to the p-values can be made to avoid inflating the possibility of making a type-I error.  There are a variety of methods for controlling the familywise error rate or for controlling the false discovery rate.  See ?p.adjust for details on these methods.

When there are many p-values to evaluate, it is useful to condense a table of p-values to a compact letter display format.  In the output, groups are separated by letters.  Groups sharing the same letter are not significantly different.  Compact letter displays are a clear and succinct way to present results of multiple comparisons.

### Order groups by median

Data\$Speaker = factor(Data\$Speaker,
levels=c("Pooh", "Tigger", "Piglet"))

levels(Data\$Speaker)

### Dunn test

library(FSA)

DT = dunnTest(Likert ~ Speaker,
data=Data,
method="bh")      # Adjusts p-values for multiple comparisons;
# See ?dunnTest for options

DT

Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Benjamini-Hochberg method.

1   Pooh - Tigger 0.4813074 0.6302980448 0.6302980448
2   Pooh - Piglet 3.7702412 0.0001630898 0.0004892695
3 Tigger - Piglet 3.2889338 0.0010056766 0.0015085149

### Compact letter display

PT = DT\$res

PT

library(rcompanion)

data = PT,
threshold = 0.05)

Group Letter MonoLetter
1   Pooh      a         a
2 Tigger      a         a
3 Piglet      b          b

Groups sharing a letter not signficantly different (alpha = 0.05).

#### Post-hoc test: pairwise Mann–Whitney U-tests for multiple comparisons

I don’t recommend using pairwise Mann–Whitney U-tests for post-hoc testing for the Kruskal–Wallis test, but the following example shows how this can be done.

The pairwise.wilcox.test function produces a table of p-values comparing each pair of groups.

To prevent the inflation of type I error rates, adjustments to the p-values can be made using the p.adjust.method option.  Here the fdr method is used.  See ?p.adjust for details on available p-value adjustment methods.

When there are many p-values to evaluate, it is useful to condense a table of p-values to a compact letter display format.  This can be accomplished with a combination of the fullPTable function in the rcompanion package and the multcompLetters function in the multcompView package.

In a compact letter display, groups sharing the same letter are not significantly different.

Here the fdr p-value adjustment method is used.  See ?p.adjust for details on available methods.

The code creates a matrix of p-values called PT, then converts this to a fuller matrix called PT1PT1 is then passed to the multcompLetters function to be converted to a compact letter display.

Note that the p-value results of the pairwise Mann–Whitney U-tests differ somewhat from those of the Dunn test.

### Order groups by median

Data\$Speaker = factor(Data\$Speaker,
levels=c("Pooh", "Tigger", "Piglet"))

Data

### Pairwise Mann–Whitney

PT = pairwise.wilcox.test(Data\$Likert,
Data\$Speaker,
# Adjusts p-values for multiple comparisons;

PT

Pairwise comparisons using Wilcoxon rank sum test

Pooh   Tigger
Tigger 0.5174 -
Piglet 0.0012 0.0012

### Note that the values in the table are p-values comparing each
###   pair of groups.

### Convert PT to a full table and call it PT1

PT = PT\$p.value    ### Extract p-value table

library(rcompanion)

PT1 = fullPTable(PT)

PT1

Pooh      Tigger      Piglet
Pooh   1.000000000 0.517377650 0.001241095
Tigger 0.517377650 1.000000000 0.001241095
Piglet 0.001241095 0.001241095 1.000000000

### Produce compact letter display

library(multcompView)

multcompLetters(PT1,
compare="<",
threshold=0.05,  # p-value to use as significance threshold
Letters=letters,
reversed = FALSE)

Pooh Tigger Piglet
"a"    "a"    "b"

### Values sharing a letter are not significantly different

### Plot of medians and confidence intervals

The following code uses the groupwiseMedian function to produce a data frame of medians for each speaker along with the 95% confidence intervals for each median with the percentile method.  These medians are then plotted, with their confidence intervals shown as error bars.  The grouping letters from the multiple comparisons (Dunn test or pairwise Mann–Whitney U-tests) are added.

Note that bootstrapped confidence intervals may not be reliable for discreet data, such as the ordinal Likert data used in these examples, especially for small samples.

library(rcompanion)

Sum = groupwiseMedian(Likert ~ Speaker,
data       = Data,
conf       = 0.95,
R          = 5000,
percentile = TRUE,
bca        = FALSE,
digits     = 3)

Sum

Speaker  n Median Conf.level Percentile.lower Percentile.upper
1    Pooh 10      4       0.95              4.0              5.0
2  Piglet 10      2       0.95              2.0              3.0
3  Tigger 10      4       0.95              3.5              4.5

X     = 1:3
Y     = Sum\$Percentile.upper + 0.2
Label = c("a", "b", "a")

library(ggplot2)

ggplot(Sum,                ### The data frame to use.
aes(x = Speaker,
y = Median)) +
geom_errorbar(aes(ymin = Percentile.lower,
ymax = Percentile.upper),
width = 0.05,
size  = 0.5) +
geom_point(shape = 15,
size  = 4) +
theme_bw() +
theme(axis.title   = element_text(face  = "bold")) +

ylab("Median Likert score") +

annotate("text",
x = X,
y = Y,
label = Label) Plot of median Likert score versus Speaker.  Error bars indicate the 95% confidence intervals for the median with the percentile method.

### References

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Routledge.

Vargha, A. and H.D. Delaney. A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong. 2000. Journal of Educational and Behavioral Statistics 25(2):101–132.

### Exercises L

1. Considering Pooh, Piglet, and Tigger’s data,

a.  What was the median score for each instructor?

b.  According to the Kruskal–Wallis test, is there a statistical difference in scores among the instructors?

c.  What is the value of maximum Vargha and Delaney’s A for these data?

d.  How do you interpret this value? (What does it mean? And is the standard interpretation in terms of “small”, “medium”, or “large”?)

e.  Looking at the post-hoc analysis, which speakers’ scores are statistically different from which others?  Who had the statistically highest scores?

f.  How would you summarize the results of the descriptive statistics and tests?  Include practical considerations of any differences.

2. Brian, Stewie, and Meg want to assess the education level of students in their courses on creative writing for adults.  They want to know the median education level for each class, and if the education level of the classes were different among instructors.

They used the following table to code his data.

Code   Abbreviation   Level

1      < HS           Less than high school
2        HS           High school
3        BA           Bachelor’s
4        MA           Master’s
5        PhD          Doctorate

The following are the course data.

Instructor        Student  Education
'Brian Griffin'   a        3
'Brian Griffin'   b        2
'Brian Griffin'   c        3
'Brian Griffin'   d        3
'Brian Griffin'   e        3
'Brian Griffin'   f        3
'Brian Griffin'   g        4
'Brian Griffin'   h        5
'Brian Griffin'   i        3
'Brian Griffin'   j        4
'Brian Griffin'   k        3
'Brian Griffin'   l        2
'Stewie Griffin'  m        4
'Stewie Griffin'  n        5
'Stewie Griffin'  o        4
'Stewie Griffin'  p        4
'Stewie Griffin'  q        4
'Stewie Griffin'  r        4
'Stewie Griffin'  s        3
'Stewie Griffin'  t        5
'Stewie Griffin'  u        4
'Stewie Griffin'  v        4
'Stewie Griffin'  w        3
'Stewie Griffin'  x        2
'Meg Griffin'     y        3
'Meg Griffin'     z        4
'Meg Griffin'     aa       3
'Meg Griffin'     ab       3
'Meg Griffin'     ac       3
'Meg Griffin'     ae       3
'Meg Griffin'     af       4
'Meg Griffin'     ag       2
'Meg Griffin'     ah       3
'Meg Griffin'     ai       2
'Meg Griffin'     aj       1

For each of the following, answer the question, and show the output from the analyses you used to answer the question.

a.  What was the median education level for each instructor’s class?  (Be sure to report the education level, not just the numeric code!)

b.  According to the Kruskal–Wallis test, is there a difference in the education level of students among the instructors?

c.  What is the value of maximum Vargha and Delaney’s A for these data?

d.  How do you interpret this value? (What does it mean? And is the standard interpretation in terms of “small”, “medium”, or “large”?)

e.  Looking at the post-hoc analysis, which classes education levels are statistically different from which others?  Who had the statistically highest education level?

f.  Plot Brian, Stewie, and Meg’s data in a way that helps you visualize the data.  Do the results reflect what you would expect from looking at the plot?

g.  How would you summarize the results of the descriptive statistics and tests?  What do you conclude practically?