Blocks are used in an analysis of variance or similar models in order to account for suspected variation from factors other than the treatments or main independent variables being investigated.
Traditionally, in agricultural experiments, plots would be arranged into blocks according to factors in the field that could not be controlled. For example, if one side of the field were close to the tree line, those plots might get less sun, or be more protected from wind. So the experimenter would consider those plots near the tree line as one block, and another set in a more open spot another block. In this way, all kinds of variation in the soils and locations could be accounted for somewhat. The experiment might be designed in a randomized complete block design in which each block had a plot with each treatment.
When using blocks, the experimenter isn’t concerned necessarily with the effect of the blocks or even the factors behind assigning those blocks. That is, the east end of the field might be by the tree line and may have soils with slightly poorer drainage. If the experimenter were testing the effects of different fertilizers, she isn’t going to try to report the effect of “next to the tree line” or “in poorly drained soil,” but she does want to account for these effects statistically.
Blocks in extension education for measurements on students might include school or class or grade. There might be differences according to these factors, but we might not care to investigate if one particular school has a different result than another particular one. Still, we want to take to these differences into account statistically.
Appropriate data
• One-way data, with blocks. That is, one measurement variable in two or more groups, where each group is also distributed among at least two blocks
• Dependent variable is interval/ratio, and is continuous
• Independent variable is a factor with two or more levels. That is, two or more groups
• A second independent variable is a blocking factor variable with two or more levels
• In the general linear model approach, residuals are normally distributed
• In the general linear model approach, groups have the same variance. That is, homoscedasticity
• Observations among groups are independent. That is, not paired or repeated measures data, except that observations are within blocks
• Moderate deviation from normally-distributed residuals is permissible
Hypotheses
• Null hypothesis: The means of the measurement variable for each group are equal
• Alternative hypothesis (two-sided): The means of the measurement variable for among groups are not equal
• An additional null hypothesis is tested for the effect of blocks: The means of the measurement variable for each block are equal
Interpretation
• Reporting significant results for the omnibus test as “Significant differences were found among means for groups.” is acceptable. Alternatively, “A significant effect for independent variable on dependent variable was found.”
• Reporting significant results for mean separation post-hoc tests as “Mean of variable Y for group A was different than that for group B.” is acceptable.
Other notes and alternative tests
• A permutation test is a nonparametric alternative. Only for unreplicated complete block designs, the traditional nonparametric analogues for this test are the Friedman and Quade tests.
Packages used in this chapter
The packages used in this chapter include:
• psych
• FSA
• ggplot2
• car
• multcompView
• lsmeans
• rcompanion
The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(FSA)){install.packages("FSA")}
if(!require(ggplot2)){install.packages("ggplot2")}
if(!require(car)){install.packages("car")}
if(!require(multcompView)){install.packages("multcompView")}
if(!require(lsmeans)){install.packages("lsmeans")}
if(!require(rcompanion)){install.packages("rcompanion")}
One-way ANOVA with blocks example
This example will revisit the sodium intake data set with Brendon Small and the other instructors. This time, though, they have recorded the town each student is from, and they would like to use this as a blocking variable. They suspect that the town of residence may have some effect on sodium intake since each town has varying income, ethnic makeup, and other demographic factors.
The instructors are focused on the effect of their different nutrition education programs. They are not concerned about sodium intake in one specific town or another per se, but they want to take account this effect into account statistically.
Note that when the blocking variable is added to the model, the calculated p-value for the effect of Instructor is different than it was for the case of the one-way ANOVA without the blocking variable.
Input = ("
Instructor Town Sodium
'Brendon Small' Squiggleville 1200
'Brendon Small' Squiggleville 1400
'Brendon Small' Squiggleville 1350
'Brendon Small' Metalocalypse 950
'Brendon Small' Squiggleville 1400
'Brendon Small' Squiggleville 1150
'Brendon Small' Squiggleville 1300
'Brendon Small' Metalocalypse 1325
'Brendon Small' Metalocalypse 1425
'Brendon Small' Squiggleville 1500
'Brendon Small' Squiggleville 1250
'Brendon Small' Metalocalypse 1150
'Brendon Small' Metalocalypse 950
'Brendon Small' Squiggleville 1150
'Brendon Small' Metalocalypse 1600
'Brendon Small' Metalocalypse 1300
'Brendon Small' Metalocalypse 1050
'Brendon Small' Metalocalypse 1300
'Brendon Small' Squiggleville 1700
'Brendon Small' Squiggleville 1300
'Coach McGuirk' Squiggleville 1100
'Coach McGuirk' Squiggleville 1200
'Coach McGuirk' Squiggleville 1250
'Coach McGuirk' Metalocalypse 1050
'Coach McGuirk' Metalocalypse 1200
'Coach McGuirk' Metalocalypse 1250
'Coach McGuirk' Squiggleville 1350
'Coach McGuirk' Squiggleville 1350
'Coach McGuirk' Squiggleville 1325
'Coach McGuirk' Squiggleville 1525
'Coach McGuirk' Squiggleville 1225
'Coach McGuirk' Squiggleville 1125
'Coach McGuirk' Metalocalypse 1000
'Coach McGuirk' Metalocalypse 1125
'Coach McGuirk' Squiggleville 1400
'Coach McGuirk' Metalocalypse 1200
'Coach McGuirk' Squiggleville 1150
'Coach McGuirk' Squiggleville 1400
'Coach McGuirk' Squiggleville 1500
'Coach McGuirk' Squiggleville 1200
'Melissa Robins' Metalocalypse 900
'Melissa Robins' Metalocalypse 1100
'Melissa Robins' Metalocalypse 1150
'Melissa Robins' Metalocalypse 950
'Melissa Robins' Metalocalypse 1100
'Melissa Robins' Metalocalypse 1150
'Melissa Robins' Squiggleville 1250
'Melissa Robins' Squiggleville 1250
'Melissa Robins' Squiggleville 1225
'Melissa Robins' Squiggleville 1325
'Melissa Robins' Metalocalypse 1125
'Melissa Robins' Metalocalypse 1025
'Melissa Robins' Metalocalypse 950
'Melissa Robins' Metalocalypse 925
'Melissa Robins' Squiggleville 1200
'Melissa Robins' Metalocalypse 1100
'Melissa Robins' Metalocalypse 950
'Melissa Robins' Metalocalypse 1300
'Melissa Robins' Squiggleville 1400
'Melissa Robins' Metalocalypse 1100
")
Data = read.table(textConnection(Input),header=TRUE)
### Order factors by the order in data frame
### Otherwise, R will alphabetize them
Data$Instructor = factor(Data$Instructor,
levels=unique(Data$Instructor))
Data$Town = factor(Data$Town,
levels=unique(Data$Town))
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
### Remove unnecessary objects
rm(Input)
Use Summarize to check if cell sizes are balanced
We can use the n or nvalid columns in the Summarize output to check if cell sizes are balanced. That is, if each treatment and block combination have the same number of observations.
library(FSA)
Summarize(Sodium ~ Instructor + Town,
data=Data,
digits=3)
Instructor Town n mean sd min Q1 median Q3 max
1 Brendon Small Squiggleville 11 1336.364 162.928 1150 1225 1300 1400 1700
2 Coach McGuirk Squiggleville 14 1292.857 134.960 1100 1200 1288 1388 1525
3 Melissa Robins Squiggleville 6 1275.000 74.162 1200 1231 1250 1306 1400
4 Brendon Small Metalocalypse 9 1227.778 220.597 950 1050 1300 1325 1600
5 Coach McGuirk Metalocalypse 6 1137.500 97.147 1000 1069 1162 1200 1250
6 Melissa Robins Metalocalypse 14 1058.929 112.919 900 950 1100 1119 1300
Define linear model
model = lm(Sodium ~ Instructor + Town,
data = Data)
summary(model) ### Will show overall p-value and
r-squared
Multiple R-squared: 0.3485, Adjusted R-squared: 0.3136
F-statistic: 9.987 on 3 and 56 DF, p-value: 2.265e-05
Conduct analysis of variance
library(car)
Anova(model,
type = "II") ### Type II sum of
squares
Anova Table (Type II tests)
Response: Sodium
Sum Sq Df F value Pr(>F)
Instructor 148944 2 3.6006 0.0338033 *
Town 329551 1 15.9332 0.0001928 ***
Residuals 1158261 56
Histogram and plot of residuals
x = residuals(model)
library(rcompanion)
plotNormalHistogram(x)
plot(fitted(model),
residuals(model))
plot(model) ### Additional model
checking plots
Post-hoc analysis: mean separation tests
For an explanation of using least square means for multiple comparisons, see the section “Post-hoc analysis: mean separation tests” in the One-way ANOVA chapter.
In this example, you’ll note that the LS means are different from the arithmetic means calculated for Instructor in the last chapter. This is a result of the fact that instructors have different numbers of students from each town. That is, the design is unbalanced.
library(multcompView)
library(lsmeans)
marginal = lsmeans(model,
~ Instructor)
pairs(marginal,
adjust="tukey")
contrast estimate SE df t.ratio p.value
Brendon Small - Coach McGuirk 64.81742 45.86048 56 1.413 0.3410
Brendon Small - Melissa Robins 124.47097 46.53123 56 2.675 0.0261
Coach McGuirk - Melissa Robins 59.65356 48.12705 56 1.240 0.4352
Results are averaged over the levels of: Town
P value adjustment: tukey method for comparing a family of 3 estimates
CLD = cld(marginal,
alpha = 0.05,
Letters = letters, ### Use lower-case letters
for .group
adjust = "tukey") ### Tukey-adjusted
p-values
CLD
Instructor lsmean SE df lower.CL upper.CL .group
Melissa Robins 1155.173 33.10792 56 1088.850 1221.496 a
Coach McGuirk 1214.827 33.10792 56 1148.504 1281.150 ab
Brendon Small 1279.644 32.21856 56 1215.103 1344.186 b
Results are averaged over the levels of: Town
Confidence level used: 0.95
P value adjustment: tukey method for comparing a family of 3 estimates
significance level used: alpha = 0.05
Plot of means, confidence intervals, and mean-separation letters
This plot uses the least square means, confidence intervals, and mean-separation letters from the compact letter display above.
For different data, the nudge_y parameter will need to be adjusted manually.
### Order the levels for printing
CLD$Instructor = factor(CLD$Instructor,
levels=c("Brendon Small",
"Coach McGuirk",
"Melissa Robins"))
### Remove spaces in .group
CLD$.group=gsub(" ", "", CLD$.group)
### Plot
library(ggplot2)
ggplot(CLD,
aes(x = Instructor,
y = lsmean,
label = .group)) +
geom_point(shape = 15,
size = 4) +
geom_errorbar(aes(ymin = lower.CL,
ymax = upper.CL),
width = 0.2,
size = 0.7) +
theme_bw() +
theme(axis.title = element_text(face = "bold"),
axis.text = element_text(face = "bold"),
plot.caption = element_text(hjust = 0)) +
ylab("Least square mean\nsodium intake (mg / day)") +
geom_text(nudge_x = c(0, 0, 0),
nudge_y = c(120, 120, 120),
color = "black")
Daily intake of sodium from diaries for students in Supplemental Nutrition
Assistance Program Education (SNAP-ed) workshops. Boxes represent least square
mean values for each of three classes, following one-way analysis of variance
(ANOVA). Error bars indicate 95% confidence intervals of the LS means. Means
sharing a letter are not significantly different (alpha = 0.05, Tukey-adjusted).
Exercises S
1. Considering Brendon, Melissa, and McGuirk’s data with
means adjusted for students’ towns,
What was the LS mean sodium intake for students for each
instructor?
Does Instructor have a significant effect on sodium
intake?
Does Town have a significant effect on sodium intake?
Are the residuals reasonably normal and homoscedastic?
Which instructors had classes with significantly different mean
sodium intake from which others?
Does this result correspond to what you would have thought from
looking at the plot of the LS means and confidence intervals?
Is this design balanced or unbalanced?
2. As part of a professional skills program, a 4-H club tests its members for
typing proficiency. Dr. Katz, and Laura, and Ben want to compare their
students’ mean typing speed between their classes. They have also recorded the
year or grade of each student. They are not interested in the effect of Year
per se, but they want to account for the effect statistically.
Instructor Year Words.per.minute
'Dr. Katz Professional Therapist' 7 35
'Dr. Katz Professional Therapist' 7 50
'Dr. Katz Professional Therapist' 7 55
'Dr. Katz Professional Therapist' 7 60
'Dr. Katz Professional Therapist' 8 65
'Dr. Katz Professional Therapist' 8 60
'Dr. Katz Professional Therapist' 8 70
'Dr. Katz Professional Therapist' 8 55
'Dr. Katz Professional Therapist' 9 45
'Dr. Katz Professional Therapist' 9 55
'Dr. Katz Professional Therapist' 9 60
'Dr. Katz Professional Therapist' 9 45
'Dr. Katz Professional Therapist' 10 65
'Dr. Katz Professional Therapist' 10 55
'Dr. Katz Professional Therapist' 11 50
'Dr. Katz Professional Therapist' 12 60
'Laura the Receptionist' 7 55
'Laura the Receptionist' 7 60
'Laura the Receptionist' 7 75
'Laura the Receptionist' 7 65
'Laura the Receptionist' 8 60
'Laura the Receptionist' 8 70
'Laura the Receptionist' 8 75
'Laura the Receptionist' 8 70
'Laura the Receptionist' 9 65
'Laura the Receptionist' 9 72
'Laura the Receptionist' 9 73
'Laura the Receptionist' 9 65
'Laura the Receptionist' 10 80
'Laura the Receptionist' 10 50
'Laura the Receptionist' 10 60
'Laura the Receptionist' 10 70
'Ben Katz' 7 55
'Ben Katz' 7 55
'Ben Katz' 7 70
'Ben Katz' 7 55
'Ben Katz' 8 65
'Ben Katz' 8 60
'Ben Katz' 8 70
'Ben Katz' 8 60
'Ben Katz' 9 60
'Ben Katz' 9 62
'Ben Katz' 9 63
'Ben Katz' 9 65
'Ben Katz' 10 75
'Ben Katz' 10 50
'Ben Katz' 10 50
'Ben Katz' 10 65
Note that the following code is necessary for R to recognize Year as a
factor variable.
### Convert Year from an integer variable to a
factor variable
Data$Year = factor(Data$Year,
levels=c("7", "8", "9",
"10"))
For each of the following, answer the question, and show the output from the analyses you used to answer the question.
What was the LS mean typing speed for each instructor’s class?
Does Instructor have a significant effect on typing
speed?
Does Year have a significant effect on typing speed?
Are the residuals reasonably normal and homoscedastic?
Which instructors had classes with significantly different mean
typing speed from which others?
Does this result correspond to what you would have thought from
looking at the plot of the LS means and standard errors?
Is this design balanced or unbalanced?