[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Using R

R and RStudio

 

This book will use the software package R Project for Statistical Computing to create plots and conduct statistical analyses.  It is free to install on a Windows, Mac, or Linux computer.  Although it is not required, I also recommend using RStudio, which is also free.

 

Links to websites where the software can be obtained are included in the “Obtaining R” section in the “Required Readings” below.

 

Using the RStudio environment

 

RStudio provides a nice work environment because it presents several windows on the screen that make it easy to view code, results, and plots at once. 

 

Program code can be worked on in the upper left Script window.  If that window isn’t displayed, it can be opened with  File  >  New File  >  R Script.  Code in the Script window is selected, and the Run button is used to run the code.  The results are reported in the Console window on the lower right.  Code can be saved as an .r or .R file, and those files should subsequently open automatically with RStudio

 

As an alternative, code can be pasted directly in the Console window. 

 

The lower right window will usually show either plot results or help results.

 

Installation

As far as I know, R should be installed first.  And then when RStudio is installed, you will tell RStudio where R is installed on the machine.  Links to obtain this software are in the Obtaining R in the Required Readings below.

 

Using the R Console environment

 

If you are not using RStudio, code is pasted directly into the R GUI Console.  Results are produced in the Console as code is entered, and plots will open in a separate window.

 

What if I don’t have my own computer?

 

If you don’t have your own computer on which to install the software, there are a few options.

 

Portable installation on a usb drive

One solution is to install R Portable on a portable usb drive.  You can then run this software on university or other computers directly from that drive, if the computer is set up to give you permission to do so.  You should check to see if you will have permission to run portable software from a usb drive on these computers.

 

Using university computers

R and RStudio are installed on Rutgers University computers in computer laboratories.  You should check with the individual computer lab, and make sure you will have permission to install additional R packages on those machines.

 

Using R online

There are websites on which you can run R in an online environment.  r-fiddle.org is one such site.  However, I have had trouble using some R packages used in this book on r-fiddle.

 

apps.rutgers.edu

R and RStudio are also included in the apps.rutgers.edu environment.  You can log in at apps.rutgers.edu with your Rutgers Net ID.  Then,  Desktop  >  Menu  >  Development  >  R Studio.  In general, I have not found this to be the most convenient environment to work in.  Working on a laptop, I have found that I had to zoom out with the browser zoom to see the whole virtual desktop on my screen.  To transfer text to the desktop environment, paste the text into the clipboard using the clipboard icon at the upper right of the screen.  Also, I have had trouble trying to install additional R packages in this environment.

 

Tests for package installation

 

If you are unsure if you can install additional R packages in the environment you are working in, try the two examples below.  The psych and FSA packages may take a while for their initial installation.  The code for you to run is in blue, and the output is in red.  The output is truncated here.

 

Don’t worry too much about what the code is doing at this point.  The main point is to see if you can get output from both the psych and FSA packages.

 

The code assigns a vector of numbers to Score, and a vector of text strings to Student.  It then combines those two into a data frame called Data, which is then printed.  The summary function counts the values in Student, and determines the median and other statistics for Score.  The psych package is installed, then loaded with the library function, and then is used to output summary statistics for Score for each Student.  The same is then done with the FSA package.

 

Remember to run only the blue code.  The red code is the (truncated) output R should produce.


Score = c(10, 9, 8, 7, 7, 8, 9, 10, 6, 5, 4, 9, 10, 9, 10)
Student = c("Bugs", "Bugs", "Bugs", "Bugs","Bugs",
            "Daffy", "Daffy", "Daffy", "Daffy", "Daffy",
            "Taz", "Taz", "Taz", "Taz", "Taz")

Data = data.frame(Student, Score)

Data


   Student Score
1     Bugs    10
2     Bugs     9
3     Bugs     8
4     Bugs     7
5     Bugs     7
6    Daffy     8
7    Daffy     9
8    Daffy    10
9    Daffy     6
10   Daffy     5
11     Taz     4
12     Taz     9
13     Taz    10
14     Taz     9
15     Taz    10


summary(Data)


  Student      Score      
 Bugs :5   Min.   : 4.000 
 Daffy:5   1st Qu.: 7.000 
 Taz  :5   Median : 9.000 
           Mean   : 8.067 
           3rd Qu.: 9.500 
           Max.   :10.000


if(!require(psych)){install.packages("psych")}

library(psych)

describeBy(x = Score,
           group = Student)


group: Bugs
  vars n mean  sd median trimmed  mad min max range skew kurtosis   se
1    1 5  8.2 1.3      8     8.2 1.48   7  10     3 0.26    -1.96 0.58
------------------------------------------------------------------------
group: Daffy
  vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 5  7.6 2.07      8     7.6 2.97   5  10     5 -0.11    -2.03 0.93
------------------------------------------------------------------------
group: Taz
  vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 5  8.4 2.51      9     8.4 1.48   4  10     6 -0.97    -1.04 1.12


if(!require(FSA)){install.packages("FSA")}

library(FSA)

Summarize(Score ~ Student,
          data=Data)


  Student n mean       sd min Q1 median Q3 max percZero
1    Bugs 5  8.2 1.303840   7  7      8  9  10        0
2   Daffy 5  7.6 2.073644   5  6      8  9  10        0
3     Taz 5  8.4 2.509980   4  9      9 10  10        0


Required readings

 

The following readings are required for this chapter.  You can read them at the individual links below, or as chapters in the pdf version of the R Companion to the Handbook of Biological Statistics (rcompanion.org/documents/RCompanionBioStatistics.pdf).

 

About R

rcompanion.org/rcompanion/a_04.html

 

Obtaining R

rcompanion.org/rcompanion/a_05.html

 

A Few Notes to Get Started with R

rcompanion.org/rcompanion/a_06.html

 

Avoiding Pitfalls in R

rcompanion.org/rcompanion/a_07.html

 

Help with R

rcompanion.org/rcompanion/a_08.html

 

R Tutorials

rcompanion.org/rcompanion/a_09.html

 

References for this chapter

 

Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09.

rcompanion.org/rcompanion/. (Pdf version: rcompanion.org/documents/RCompanionBioStatistics.pdf.)



Exercises A

 

1.  Install R and RStudio on your computer, or determine how you will access this software.  Be sure that you will be able to install additional R packages (as in “Tests for package installation” above).

 

If you haven't installed the FSA and psych packages, do so with the following command, or use Tools  >  Install packages  in RStudio.

 

if(!require(FSA)){install.packages("FSA")}
if(!require(psych)){install.packages("psych")}

Use the following code to import from the internet a data frame of river stage measurements for Greenwich, NJ.


Greenwich = read.table("http://rcompanion.org/documents/Greenwich.csv",
                        header=TRUE, sep=",")

 

a. Summarize the stage measurements by year and report the results.

 



library(FSA)

Summarize(Stage ~ Year,
          data = Greenwich)

 

Examine the variables in the data frame.  (Don’t report anything.)


str(Greenwich)


library(psych)

headTail(Greenwich)

summary(Greenwich)

The str function reports the type of each variable in the data frame.  Agency is treated as a factor variable by R.  We’ll consider this a nominal variable.  It has one level, which is “USGS”.   The rest of the variables are being treated as integer or numeric variables.

 

The function headTail in the psych package shows you the top and bottom of the data frame, arranged in the usual way with variables in columns and observations in rows.

 

The summary function reports some summary statistics for the data frame.  In this case, it reports that there are 171 instances of “USGS” for the variable Agency.  Since it treats Year as an integer variable, it tells you that the minimum is 2000 and the maximum is 2014.  It tells us that there are 6 NA values in the variable Stage.

 

These functions are useful to see what variables are in a data frame, and to be sure that the data frame reflects the data we think it should.  For example, this data frame has 171 observations.  If we were expecting 1000 observations, we would know something went wrong in our data entry or importing of the data.

 

We could create a new variable called Year.f that would be the same as Year, but treated as a factor variable.


Greenwich$Year.f = as.factor(Greenwich$Year)

 

Now when we use the summary function, it will report the counts for some levels of Year. f.


summary(Greenwich)

 

b. In future chapters, we will learn to summarize data in more specific ways.  In this case, if we want the counts for all the years in Year.f, we can do the following.   Report this result.


xtabs(~ Year.f,
      data = Greenwich)

 

2. The following will create a data frame called TwoCats with Pepé’s and Penelope’s scores from a talent show, and produce summary statistics for Penelope’s scores.


Input = ("
Cat            Score
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
Penelope       10
Penelope        9
Penelope        9
Penelope        8
Penelope        7
Penelope        7
")

TwoCats = read.table(textConnection(Input),header=TRUE)

library(FSA)

Summarize(Score ~ Cat,
          data = TwoCats,
          exclude = "Pepé Le Pew")


a.  Report the results for Penelope.

 

Change Pepé’s scores to 9, 9, 7, 7, 6, 5.  Change the exclude option in the Summarize function to exclude only Penelope.  Penelope’s name will need to be in simple double quotes. 

 

b.  Report the results for the summary statistics for Pepé’s new scores.  (The new mean for Pepe should be 7.1667.)

 

c.  Summarize the data below with the Summarize function in the FSA package.  Report the results. 

 

To do this,

•  Copy the data below and paste it into to code above.  Make sure you retain the Input line and the ") line.


Dog      Snacks
Scooby    4
Scooby    3
Scooby    6
Scooby   18
Scooby    7
Scrappy   8
Scrappy  10
Scrappy   6
Scrappy   5
Scrappy  15
Scrappy   7
Scrappy   9


•  Get rid of the exclude option in the Summarize call.  Pay attention to the placement of the commas and parentheses, so it will look like this:

 

Summarize(Score ~ Cat,
          data = TwoCats)

•  In the code, change TwoCats to TwoDogs.

•  In the Summarize function call, change TwoCats to TwoDogs, change Score to Snacks, change Cat to Dog.

 

You should get summary statistics for both Scooby and Scrappy in one output.  The number of observations (n) for Scooby should be 5 and for Scrappy should be 7.  The mean snacks for Scooby should be 7.6 and the mean snacks for Scrappy should be 8.57.



Exercises Alpha

 

1.  The package FSA contains a data set called ChinookArg that has the lengths and weights of Chinook salmon at three locations.

 

library(FSA)

data(ChinookArg)

 

We can find some information about the data set with

 

?ChinookArg

 

a.  Report, From what county are these data?

 

We can see the first and last rows of the data with

 

library(pysch)

headTail(ChinookArg)

 

Note that the variable tl is length, w is weight, and loc is location.  This is explained in the help file pulled up with

 

?ChinookArg

 

We can then get some summary statistics about the data.

 

library(FSA)

Summarize(tl ~ loc,
          data = ChinookArg)

 

This will provide summary statistics about length at each location.

 

In the output, n is the number of observations for that group.  The other summary statistics are mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum.  These statistics will be discussed in a later chapter.

 

This example uses formula notation, where tl is measurement variable and loc is the grouping variable, and they are separated with a tilde.  We could also think of tl as the dependent variable and loc as the independent variable. 

 

Some functions accept formula notation and some do not.  Usually asking for help about the function will help you determine what input is required and what other arguments can be passed to the function, e.g.

 

?Summarize

 

Answer the following:

 

b.  What is the number of observations for Petrohue?

 

c . What is the mean length for Chinook in Petrohue?

 

d.  The minimum length in Petrohue?

 

e.  The maximum length in Petrohue?

 

2.  Install the ggplot2 package with following command, or use Tools  >  Install packages  in RStudio.

 

if(!require(ggplot2)){install.packages("ggplot2")}

The following code will plot Chinook length vs. weight.  It will add a smooth line to the plot.

 

library(ggplot2)

qplot(x    = w,
      y    = tl,
      data = ChinookArg,
      geom = c("point", "smooth"),
      xlab = "Weight (kg)",
      ylab = "Length (cm)",
      main = "Chinook plot by Sal")

 

a. In the code above, change "Sal" to your name.  Export the plot and embed it in your assignment.

 

In RStudio, in the Plot window, you can try the Export menu, and save as image, save as pdf, or copy to clipboard

 

For assignments, probably the easiest thing is to use Export, save as image.  The recommended format is .png.  Pay attention to the directory where your image was saved.

 

For high resolution images, I tend to use the .pdf option and then edit the file with Photoshop or GIMP.

 

If you don't use RStudio, or would like to specify the size and resolution of image files, you can use

 

png(filename = "Rplot%03d.png",

    width  = 4,
    height = 3,
    units  = "in",
    res    = 600)

   

qplot(x    = w,
      y    = tl,

      data = ChinookArg,

      geom = c("point", "smooth"),

      xlab = "Weight (kg)",

      ylab = "Length (cm)",

      main = "Chinook plot by Sal")   

 

dev.off()

 

You may use the following to see where the file was saved, if you didn’t specify a path in the filename argument.

 

getwd()



3.  In the qplot function above, switch the variable for weight, w, with the variable for location, loc.  Remove the whole line with the geom argument.  And change the xlab argument to something appropriate.

 

a.  Export the plot and embed it in your assignment.

 

 

4.  There's another data set in the FSA package called WhitefishLC.

 

Using the command

 

?WhitefishLC

 

Answer the following lettered questions.

 

a.    What are the units for fish length in this data set?

 

 

Using this data set, summarize the length of whitefish (tl) by their age (scale1).

 

Use the code for Summarize that you used in 1.a.  Make sure you change the names of the variables in the function and the name of the data frame in the function.

 

b.  Report the results.

 

c.  Do you observe anything interesting in these results?

 

 

Plot length vs. age for whitefish.

 

Use the code you used in 2.  Be sure to change the name of the variables and of the data frame.

  

d  Embed this plot with your assignment.  Include appropriate axis labels and be sure the units of length are correct.  If you include a title, be sure it is appropriate.

 

e. How would you describe the relationship of length and age for these fish?