Blocks are used in an analysis of variance or similar models in order to account for suspected variation from factors other than the treatments or main independent variables being investigated.

Traditionally, in agricultural experiments, plots would be
arranged into blocks according to factors in the field that could not be
controlled. For example, if one side of the field were close to the tree line,
those plots might get less sun, or be more protected from wind. So the
experimenter would consider those plots near the tree line as one block, and
another set in a more open spot another block. In this way, all kinds of
variation in the soils and locations could be accounted for somewhat. The
experiment might be designed in a *randomized complete block design* in
which each block had a plot with each treatment.

When using blocks, the experimenter isn’t concerned necessarily with the effect of the blocks or even the factors behind assigning those blocks. That is, the east end of the field might be by the tree line and may have soils with slightly poorer drainage. If the experimenter were testing the effects of different fertilizers, she isn’t going to try to report the effect of “next to the tree line” or “in poorly drained soil,” but she does want to account for these effects statistically.

Blocks in extension education for measurements on students might include school or class or grade. There might be differences according to these factors, but we might not care to investigate if one particular school has a different result than another particular one. Still, we want to take to these differences into account statistically.

##### Appropriate data

• One-way data, with blocks. That is, one measurement variable in two or more groups, where each group is also distributed among at least two blocks

• Dependent variable is interval/ratio, and is continuous

• Independent variable is a factor with two or more levels. That is, two or more groups

• A second independent variable is a blocking factor variable with two or more levels

• In the general linear model approach, residuals are normally distributed

• In the general linear model approach, groups have the same variance. That is, homoscedasticity

• Observations among groups are independent. That is, not paired or repeated measures data, except that observations are within blocks

• Moderate deviation from normally-distributed residuals is permissible

##### Hypotheses

• Null hypothesis: The means of the measurement variable for each group are equal

• Alternative hypothesis (two-sided): The means of the measurement variable for among groups are not equal

• An additional null hypothesis is tested for the effect of blocks: The means of the measurement variable for each block are equal

##### Interpretation

• Reporting significant results for the omnibus test as “Significant differences were found among means for groups.” is acceptable. Alternatively, “A significant effect for independent variable on dependent variable was found.”

• Reporting significant results for mean separation post-hoc tests as “Mean of variable Y for group A was different than that for group B.” is acceptable.

##### Other notes and alternative tests

• Only for *unreplicated complete block designs*, the nonparametric
analogues for this test are the Friedman and Quade tests.

### Packages used in this chapter

The packages used in this chapter include:

• psych

• FSA

• ggplot2

• car

• multcompView

• lsmeans

• rcompanion

The following commands will install these packages if they are not already installed:

if(!require(psych)){install.packages("psych")}

if(!require(FSA)){install.packages("FSA")}

if(!require(ggplot2)){install.packages("ggplot2")}

if(!require(car)){install.packages("car")}

if(!require(multcompView)){install.packages("multcompView")}

if(!require(lsmeans)){install.packages("lsmeans")}

if(!require(rcompanion)){install.packages("rcompanion")}

### One-way ANOVA with blocks example

This example will revisit the sodium intake data set with Brendon Small and the other instructors. This time, though, they have recorded the town each student is from, and they would like to use this as a blocking variable. They suspect that the town of residence may have some effect on sodium intake since each town has varying income, ethnic makeup, and other demographic factors.

The instructors are focused on the effect of their different
nutrition education programs. They are not concerned about sodium intake in
one specific town or another *per se*, but they want to take account this
effect into account statistically.

Note that when the blocking variable is added to the model,
the calculated *p*-value for the effect of *Instructor* is different
than it was for the case of the one-way ANOVA without the blocking variable.

Input = ("

Instructor Town Sodium

'Brendon Small' Squiggleville 1200

'Brendon Small' Squiggleville 1400

'Brendon Small' Squiggleville 1350

'Brendon Small' Metalocalypse 950

'Brendon Small' Squiggleville 1400

'Brendon Small' Squiggleville 1150

'Brendon Small' Squiggleville 1300

'Brendon Small' Metalocalypse 1325

'Brendon Small' Metalocalypse 1425

'Brendon Small' Squiggleville 1500

'Brendon Small' Squiggleville 1250

'Brendon Small' Metalocalypse 1150

'Brendon Small' Metalocalypse 950

'Brendon Small' Squiggleville 1150

'Brendon Small' Metalocalypse 1600

'Brendon Small' Metalocalypse 1300

'Brendon Small' Metalocalypse 1050

'Brendon Small' Metalocalypse 1300

'Brendon Small' Squiggleville 1700

'Brendon Small' Squiggleville 1300

'Coach McGuirk' Squiggleville 1100

'Coach McGuirk' Squiggleville 1200

'Coach McGuirk' Squiggleville 1250

'Coach McGuirk' Metalocalypse 1050

'Coach McGuirk' Metalocalypse 1200

'Coach McGuirk' Metalocalypse 1250

'Coach McGuirk' Squiggleville 1350

'Coach McGuirk' Squiggleville 1350

'Coach McGuirk' Squiggleville 1325

'Coach McGuirk' Squiggleville 1525

'Coach McGuirk' Squiggleville 1225

'Coach McGuirk' Squiggleville 1125

'Coach McGuirk' Metalocalypse 1000

'Coach McGuirk' Metalocalypse 1125

'Coach McGuirk' Squiggleville 1400

'Coach McGuirk' Metalocalypse 1200

'Coach McGuirk' Squiggleville 1150

'Coach McGuirk' Squiggleville 1400

'Coach McGuirk' Squiggleville 1500

'Coach McGuirk' Squiggleville 1200

'Melissa Robins' Metalocalypse 900

'Melissa Robins' Metalocalypse 1100

'Melissa Robins' Metalocalypse 1150

'Melissa Robins' Metalocalypse 950

'Melissa Robins' Metalocalypse 1100

'Melissa Robins' Metalocalypse 1150

'Melissa Robins' Squiggleville 1250

'Melissa Robins' Squiggleville 1250

'Melissa Robins' Squiggleville 1225

'Melissa Robins' Squiggleville 1325

'Melissa Robins' Metalocalypse 1125

'Melissa Robins' Metalocalypse 1025

'Melissa Robins' Metalocalypse 950

'Melissa Robins' Metalocalypse 925

'Melissa Robins' Squiggleville 1200

'Melissa Robins' Metalocalypse 1100

'Melissa Robins' Metalocalypse 950

'Melissa Robins' Metalocalypse 1300

'Melissa Robins' Squiggleville 1400

'Melissa Robins' Metalocalypse 1100

")

Data = read.table(textConnection(Input),header=TRUE)

### Order factors by the order in data frame

### Otherwise, R will alphabetize them

Data$Instructor = factor(Data$Instructor,

levels=unique(Data$Instructor))

Data$Town = factor(Data$Town,

levels=unique(Data$Town))

### Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)

### Remove unnecessary objects

rm(Input)

#### Use Summarize to check if cell sizes are balanced

We can use the *n* or *nvalid* columns in the *Summarize*
output to check if cell sizes are balanced. That is, if each treatment and
block combination have the same number of observations.

library(FSA)

Summarize(Sodium ~ Instructor + Town,

data=Data,

digits=3)

#### Define linear model

model = lm(Sodium ~ Instructor + Town,

data = Data)

summary(model) ### Will show overall p-value and
r-squared

Multiple R-squared: 0.3485, Adjusted R-squared: 0.3136

F-statistic: 9.987 on 3 and 56 DF, p-value: 2.265e-05

#### Conduct analysis of variance

library(car)

Anova(model,

type = "II") ### Type II sum of
squares

Anova Table (Type II tests)

Response: Sodium

Sum Sq Df F value Pr(>F)

Instructor 148944 2 3.6006 0.0338033 *

Town 329551 1 15.9332 0.0001928 ***

Residuals 1158261 56

#### Histogram and plot of residuals

x = residuals(model)

library(rcompanion)

plotNormalHistogram(x)

plot(fitted(model),

residuals(model))

plot(model) ### Additional model
checking plots

#### Post-hoc analysis: mean separation tests

For an explanation of using least square means for multiple
comparisons, see the section “Post-hoc analysis: mean separation tests” in the
*One-way ANOVA* chapter.

In this example, you’ll note that the LS means are different
from the arithmetic means calculated for *Instructor* in the last
chapter. This is a result of the fact that instructors have different numbers
of students from each town. That is, the design is *unbalanced*.

library(multcompView)

library(lsmeans)

leastsquare = lsmeans(model,

pairwise ~ Instructor,

adjust = "tukey")

leastsquare

$lsmeans

Instructor lsmean SE df lower.CL upper.CL

Brendon Small 1279.644 32.21856 56 1215.103 1344.186

Coach McGuirk 1214.827 33.10792 56 1148.504 1281.150

Melissa Robins 1155.173 33.10792 56 1088.850 1221.496

Results are averaged over the levels of: Town

Confidence level used: 0.95

$contrasts

contrast estimate SE df t.ratio p.value

Brendon Small - Coach McGuirk 64.81742 45.86048 56 1.413 0.3410

Brendon Small - Melissa Robins 124.47097 46.53123 56 2.675 0.0261

Coach McGuirk - Melissa Robins 59.65356 48.12705 56 1.240 0.4352

Results are averaged over the levels of: Town

P value adjustment: tukey method for comparing a family of 3 estimates

cld(leastsquare,

alpha = 0.05,

Letters = letters, ### Use lower-case letters
for .group

adjust = "tukey") ### Tukey-adjusted
p-values

Instructor lsmean SE df lower.CL upper.CL .group

Melissa Robins 1155.173 33.10792 56 1088.850 1221.496 a

Coach McGuirk 1214.827 33.10792 56 1148.504 1281.150 ab

Brendon Small 1279.644 32.21856 56 1215.103 1344.186 b

Results are averaged over the levels of: Town

Confidence level used: 0.95

P value adjustment: tukey method for comparing a family of 3 estimates

significance level used: alpha = 0.05

#### Plot of LS means and confidence intervals

Extracting the LS means from the output of the *lsmeans*
function takes a few extra steps. We’ll create a data frame called *Sum*,
and pass that data frame to *ggplot* to plot the LS means and confidence
intervals.

When a design is unbalanced, it is usually desirable to
present LS means. For further discussion on this, see the chapter *What are
Least Square Means?*

### Extract variables from leastsquare and create new data frame called Sum

Instructor = summary(leastsquare$lsmeans)["Instructor"]

lsmean = summary(leastsquare$lsmeans)["lsmean"]

lower.CL = summary(leastsquare$lsmeans)["lower.CL"]

upper.CL = summary(leastsquare$lsmeans)["upper.CL"]

Sum = data.frame(Instructor, lsmean, lower.CL, upper.CL)

Sum

library(ggplot2)

ggplot(Sum, ### The data frame to
use.

aes(x = Instructor,

y = lsmean)) +

geom_errorbar(aes(ymin = lower.CL,

ymax = upper.CL),

width = 0.05,

size = 0.5) +

geom_point(shape = 15,

size = 4) +

theme_bw() +

theme(axis.title = element_text(face = "bold")) +

ylab("LS mean sodium, mg")

Least square mean daily sodium intake for students in
each of three classes following different nutrition education programs. Error
bars represent confidence intervals of the least square mean.

### Exercises S

1. Considering Brendon, Melissa, and McGuirk’s data with
means adjusted for students’ towns,

What was the LS mean sodium intake for students for each
instructor?

Does *Instructor* have a significant effect on sodium
intake?

Does *Town* have a significant effect on sodium intake?

Are the residuals reasonably normal and homoscedastic?

Which instructors had classes with significantly different mean
sodium intake from which others?

Does this result correspond to what you would have thought from
looking at the plot of the LS means and confidence intervals?

Is this design balanced or unbalanced?

2. As part of a professional skills program, a 4-H club tests its members for
typing proficiency. Dr. Katz, and Laura, and Ben want to compare their
students’ mean typing speed between their classes. They have also recorded the
year or grade of each student. They are not interested in the effect of *Year*
*per se*, but they want to account for the effect statistically.

Instructor Year Words.per.minute

'Dr. Katz Professional Therapist' 7 35

'Dr. Katz Professional Therapist' 7 50

'Dr. Katz Professional Therapist' 7 55

'Dr. Katz Professional Therapist' 7 60

'Dr. Katz Professional Therapist' 8 65

'Dr. Katz Professional Therapist' 8 60

'Dr. Katz Professional Therapist' 8 70

'Dr. Katz Professional Therapist' 8 55

'Dr. Katz Professional Therapist' 9 45

'Dr. Katz Professional Therapist' 9 55

'Dr. Katz Professional Therapist' 9 60

'Dr. Katz Professional Therapist' 9 45

'Dr. Katz Professional Therapist' 10 65

'Dr. Katz Professional Therapist' 10 55

'Dr. Katz Professional Therapist' 11 50

'Dr. Katz Professional Therapist' 12 60

'Laura the Receptionist' 7 55

'Laura the Receptionist' 7 60

'Laura the Receptionist' 7 75

'Laura the Receptionist' 7 65

'Laura the Receptionist' 8 60

'Laura the Receptionist' 8 70

'Laura the Receptionist' 8 75

'Laura the Receptionist' 8 70

'Laura the Receptionist' 9 65

'Laura the Receptionist' 9 72

'Laura the Receptionist' 9 73

'Laura the Receptionist' 9 65

'Laura the Receptionist' 10 80

'Laura the Receptionist' 10 50

'Laura the Receptionist' 10 60

'Laura the Receptionist' 10 70

'Ben Katz' 7 55

'Ben Katz' 7 55

'Ben Katz' 7 70

'Ben Katz' 7 55

'Ben Katz' 8 65

'Ben Katz' 8 60

'Ben Katz' 8 70

'Ben Katz' 8 60

'Ben Katz' 9 60

'Ben Katz' 9 62

'Ben Katz' 9 63

'Ben Katz' 9 65

'Ben Katz' 10 75

'Ben Katz' 10 50

'Ben Katz' 10 50

'Ben Katz' 10 65

Note that the following code is necessary for R to recognize *Year* as a
factor variable.

### Convert Year from an integer variable to a
factor variable

Data$Year = factor(Data$Year,

levels=c("7", "8", "9",
"10"))

For each of the following, answer the question, and ** show
the output from the analyses you used to answer the question**.

What was the LS mean typing speed for each instructor’s class?

Does *Instructor* have a significant effect on typing
speed?

Does *Year* have a significant effect on typing speed?

Are the residuals reasonably normal and homoscedastic?

Which instructors had classes with significantly different mean
typing speed from which others?

Does this result correspond to what you would have thought from
looking at the plot of the LS means and standard errors?

Is this design balanced or unbalanced?