Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

One-way ANOVA

A one-way analysis of variance (ANOVA) is similar to a t-test, except that it is capable of comparing more than two groups.


We will conduct the ANOVA by constructing a general linear model with the lm function in the native stats package.  The general linear model is the basis for more advanced parametric models that can include multiple independent variables that can be continuous or factor variables.


Appropriate data

•  One-way data.  That is, one measurement variable in two or more groups

•  Dependent variable is interval/ratio, and is continuous

•  Independent variable is a factor with two or more levels.  That is, two or more groups

•  In the general linear model approach, residuals are normally distributed

•  In the general linear model approach, groups have the same variance.  That is, homoscedasticity

•  Observations among groups are independent.  That is, not paired or repeated measures data

•  Moderate deviation from normally-distributed residuals is permissible



•  Null hypothesis:  The means of the measurement variable for each group are equal.

•  Alternative hypothesis (two-sided): The means of the measurement variable for among groups are not equal.



•  Reporting significant results for the omnibus test as “Significant differences were found among means for groups.” is acceptable.  Alternatively, “A significant effect for Independent Variable on Dependent Variable was found.”

•  Reporting significant results for mean separation post-hoc tests as “Mean of variable Y for group A was different than that for group B.” is acceptable.


Other notes and alternative tests

•  The nonparametric analogue for this test is the Kruskal–Wallis test.  Another nonparametric approach is to use a permutation test.  See Mangiafico (2015b) in the “References” section.

•  Power analysis for the two-sample t-test can be found at Mangiafico (2015a) in the “References” section.


Packages used in this chapter


The packages used in this chapter include:

•  psych

•  FSA

•  Rmisc

•  ggplot2

•  car

•  multcompView

•  lsmeans

•  rcompanion


The following commands will install these packages if they are not already installed:


One-way ANOVA example


In the following example, Brendon Small, Coach McGuirk, and Melissa Robbins have their SNAP-Ed students keep diaries of what they eat for a week, and then calculate the daily sodium intake in milligrams.  Since the classes have received different nutrition education programs, they want to see if the mean sodium intake is the same among classes.


The analysis will be conducted by constructing a general linear model with the lm function in the native stats package, and conducting the analysis of variance with the Anova function in the car package.


Plots will be used to check model assumptions for normality of residuals and homoscedasticity.


Finally, if the effect of Instructor is significant, a mean comparison test will be conducted to determine which means differ from which others.

Input = ("
Instructor       Student  Sodium
'Brendon Small'      a    1200
'Brendon Small'      b    1400
'Brendon Small'      c    1350
'Brendon Small'      d     950
'Brendon Small'      e    1400
'Brendon Small'      f    1150
'Brendon Small'      g    1300
'Brendon Small'      h    1325
'Brendon Small'      i    1425
'Brendon Small'      j    1500
'Brendon Small'      k    1250
'Brendon Small'      l    1150
'Brendon Small'      m     950
'Brendon Small'      n    1150
'Brendon Small'      o    1600
'Brendon Small'      p    1300
'Brendon Small'      q    1050
'Brendon Small'      r    1300
'Brendon Small'      s    1700
'Brendon Small'      t    1300
'Coach McGuirk'      u    1100
'Coach McGuirk'      v    1200
'Coach McGuirk'      w    1250
'Coach McGuirk'      x    1050
'Coach McGuirk'      y    1200
'Coach McGuirk'      z    1250
'Coach McGuirk'      aa   1350
'Coach McGuirk'      ab   1350
'Coach McGuirk'      ac   1325
'Coach McGuirk'      ad   1525
'Coach McGuirk'      ae   1225
'Coach McGuirk'      af   1125
'Coach McGuirk'      ag   1000
'Coach McGuirk'      ah   1125
'Coach McGuirk'      ai   1400
'Coach McGuirk'      aj   1200
'Coach McGuirk'      ak   1150
'Coach McGuirk'      al   1400
'Coach McGuirk'      am   1500
'Coach McGuirk'      an   1200
'Melissa Robins'     ao   900
'Melissa Robins'     ap   1100
'Melissa Robins'     aq   1150
'Melissa Robins'     ar   950
'Melissa Robins'     as   1100
'Melissa Robins'     at   1150
'Melissa Robins'     au   1250
'Melissa Robins'     av   1250
'Melissa Robins'     aw   1225
'Melissa Robins'     ax   1325
'Melissa Robins'     ay   1125
'Melissa Robins'     az   1025
'Melissa Robins'     ba    950
'Melissa Robins'     bc    925
'Melissa Robins'     bd   1200
'Melissa Robins'     be   1100
'Melissa Robins'     bf    950
'Melissa Robins'     bg   1300
'Melissa Robins'     bh   1400
'Melissa Robins'     bi   1100

Data = read.table(textConnection(Input),header=TRUE)

###  Order factors by the order in data frame
###  Otherwise, R will alphabetize them

Data$Instructor = factor(Data$Instructor,

###  Check the data frame





### Remove unnecessary objects


Summarize data by group


Summarize(Sodium ~ Instructor,

         Instructor  n nvalid    mean      sd  min   Q1 median   Q3  max percZero
1     Brendon Small 20     20 1287.50 193.734  950 1150   1300 1400 1700        0
2     Coach McGuirk 20     20 1246.25 142.412 1000 1144   1212 1350 1525        0
3 Melissa Robins    20     20 1123.75 143.149  900 1006   1112 1231 1400        0

Box plots for data by group

boxplot(Sodium ~ Instructor,
        data = Data)


Plot of means and confidence intervals

For this plot, a data frame called Sum will be created with the mean sodium intake and confidence interval for this mean for each instructor.


Sum = groupwiseMean(Sodium ~ Instructor,
                    data   = Data,
                    conf   = 0.95,
                    digits = 3,
                    traditional = FALSE,
                    percentile  = TRUE)


      Instructor  n Mean Conf.level Percentile.lower Percentile.upper

1  Brendon Small 20 1290       0.95             1200             1370

2  Coach McGuirk 20 1250       0.95             1190             1310

3 Melissa Robins 20 1120       0.95             1060             1180


ggplot(Sum,                ### The data frame to use.
       aes(x = Instructor,
           y = Mean)) +
   geom_errorbar(aes(ymin = Percentile.lower,
                     ymax = Percentile.upper),
                     width = 0.05,
                     size  = 0.5) +
   geom_point(shape = 15,
              size  = 4) +
   theme_bw() +
   theme(axis.title   = element_text(face  = "bold")) +
   ylab("Mean sodium, mg")


Define linear model

model = lm(Sodium ~ Instructor,
           data = Data)

summary(model)   ### Will show overall p-value and r-square

Multiple R-squared:  0.1632,  Adjusted R-squared:  0.1338

F-statistic: 5.558 on 2 and 57 DF,  p-value: 0.006235

Conduct analysis of variance


      type = "II")   ### Type II sum of squares

Anova Table (Type II tests)

Response: Sodium

            Sum Sq Df F value   Pr(>F)  
Instructor  290146  2  5.5579 0.006235 **
Residuals  1487812 57

Histogram of residuals

x = residuals(model)






plot(model)   ### Additional model checking plots

Post-hoc analysis:  mean separation tests

For an explanation of using least square means for multiple comparisons, review the following chapters:

•  What are Least Square Means?

•  Least Square Means for Multiple Comparisons


The code defines an object leastsquare which will hold the output of the lsmeans function.  The lsmeans function here calls for all pairwise comparisons for the independent variable Instructor in the previously defined model object.


There are two ways to view the lsmeans output.  Calling the leastsquare object to be printed produces two sections of output:  $lsmeans, which shows estimates for the LS means along with their standard errors and confidence intervals;  and $contrasts, which indicate the pairwise comparisons with p-values for the compared LS means being different.


Using the cld function produces a compact letter display.  LS means sharing a letter are not significantly different from one another at the alpha level indicated.  A Tukey adjustment is made for multiple comparisons with the adjust="tukey" option.  And Letters indicates the symbols to use for groups. 


To adjust the confidence intervals for the LS means, use e.g. summary(leastsquare, level=0.99).


Note that because Instructor is the only independent variable in the model, the LS means are equal to the arithmetic means produced by the Summarize function in this case. 



leastsquare = lsmeans(model,
                      pairwise ~ Instructor,
                      adjust = "tukey")


 Instructor         lsmean       SE df lower.CL upper.CL
 Brendon Small     1287.50 36.12615 57 1215.159 1359.841
 Coach McGuirk     1246.25 36.12615 57 1173.909 1318.591
 Melissa Robins    1123.75 36.12615 57 1051.409 1196.091

Confidence level used: 0.95

 contrast                          estimate       SE df t.ratio p.value
 Brendon Small - Coach McGuirk        41.25 51.09009 57   0.807  0.7000
 Brendon Small - Melissa Robins      163.75 51.09009 57   3.205  0.0062
 Coach McGuirk - Melissa Robins      122.50 51.09009 57   2.398  0.0510

P value adjustment: tukey method for comparing a family of 3 estimates

    alpha   = 0.05,
    Letters = letters,    ### Use lower-case letters for .group
    adjust="tukey")       ### Tukey-adjusted comparisons

Instructor         lsmean       SE df lower.CL upper.CL .group
 Melissa Robins    1123.75 36.12615 57 1034.882 1212.618  a   
 Coach McGuirk     1246.25 36.12615 57 1157.382 1335.118  ab  
 Brendon Small     1287.50 36.12615 57 1198.632 1376.368   b  

Confidence level used: 0.95
P value adjustment: tukey method for comparing a family of 3 estimates
significance level used: alpha = 0.05



“One-way Anova” in Mangiafico, S.S. 2015a. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/d_05.html.


“One-way Analysis with Permutation Test” in Mangiafico, S.S. 2015b. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/d_06a.html.


Exercises R

1. Considering Brendon, Melissa, and McGuirk’s data,

What was the mean sodium intake for each instructor?

Does Instructor have a significant effect on sodium intake?

Are the residuals reasonably normal and homoscedastic?

Which instructors had classes with significantly different mean sodium intake from which others?

Does this result correspond to what you would have thought from looking at the plot of the means and standard errors or the box plots?

How does the mean separation change if you change the alpha level to 0.06?

2.  As part of a professional skills program, a 4-H club tests its members for typing proficiency.  Dr. Katz, Laura, and Ben want to compare their students’ mean typing speed between their classes.

Instructor                             Student  Words.per.minute
'Dr. Katz Professional Therapist'      a        35
'Dr. Katz Professional Therapist'      b        50
'Dr. Katz Professional Therapist'      c        55
'Dr. Katz Professional Therapist'      d        60
'Dr. Katz Professional Therapist'      e        65
'Dr. Katz Professional Therapist'      f        60
'Dr. Katz Professional Therapist'      g        70
'Dr. Katz Professional Therapist'      h        55
'Dr. Katz Professional Therapist'      i        45
'Dr. Katz Professional Therapist'      j        55
'Dr. Katz Professional Therapist'      k        60
'Dr. Katz Professional Therapist'      l        45
'Dr. Katz Professional Therapist'      m        65
'Dr. Katz Professional Therapist'      n        55
'Dr. Katz Professional Therapist'      o        50
'Dr. Katz Professional Therapist'      p        60
'Laura the Receptionist'               q        55
'Laura the Receptionist'               r        60
'Laura the Receptionist'               s        75
'Laura the Receptionist'               t        65
'Laura the Receptionist'               u        60
'Laura the Receptionist'               v        70
'Laura the Receptionist'               w        75
'Laura the Receptionist'               x        70
'Laura the Receptionist'               y        65
'Laura the Receptionist'               z        72
'Laura the Receptionist'               aa       73
'Laura the Receptionist'               ab       65
'Laura the Receptionist'               ac       80
'Laura the Receptionist'               ad       50
'Laura the Receptionist'               ae       60
'Laura the Receptionist'               af       70
'Ben Katz'                             ag       55
'Ben Katz'                             ah       55
'Ben Katz'                             ai       70
'Ben Katz'                             aj       55
'Ben Katz'                             ak       65
'Ben Katz'                             al       60
'Ben Katz'                             am       70
'Ben Katz'                             an       60
'Ben Katz'                             ao       60
'Ben Katz'                             ap       62
'Ben Katz'                             aq       63
'Ben Katz'                             ar       65
'Ben Katz'                             as       75
'Ben Katz'                             at       50
'Ben Katz'                             au       50
'Ben Katz'                             av       65

For each of the following, answer the question, and show the output from the analyses you used to answer the question.


What was the mean typing speed for each instructor?

Does Instructor have a significant effect on typing speed?

Are the residuals reasonably normal and homoscedastic?

Which instructors had classes with significantly different mean typing speed from which others?

Does this result correspond to what you would have thought from looking at the plot of the means and standard errors or box plots?