[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Using R

R and RStudio

 

This book will use the software package R Project for Statistical Computing to create plots and conduct statistical analyses.  It is free to install on a Windows, Mac, or Linux computer.  Although it is not required, I also recommend using RStudio, which is also free.

 

Links to websites where the software can be obtained are included in the “Obtaining R” section in the “Required Readings” below.

 

Using the RStudio environment

 

RStudio provides a nice work environment because it presents several windows on the screen that make it easy to view code, results, and plots at once. 

 

Program code can be worked on in the upper left Script window.  If that window isn’t displayed, it can be opened with  File  >  New File  >  R Script.  Code in the Script window is selected, and the Run button is used to run the code.  The results are reported in the Console window on the lower right.  Code can be saved as an .r or .R file, and those files should subsequently open automatically with RStudio

 

As an alternative, code can be pasted directly in the Console window. 

 

The lower right window will usually show either plot results or help results.

 

Installation

As far as I know, R should be installed first.  And then when RStudio is installed, you will tell RStudio where R is installed on the machine.  Links to obtain this software are in the Obtaining R in the Required Readings below.

 

Using the R Console environment

 

If you are not using RStudio, code is pasted directly into the R GUI Console.  Results are produced in the Console as code is entered, and plots will open in a separate window.

 

What if I don’t have my own computer?

 

If you don’t have your own computer on which to install the software, there are a few options.

 

Portable installation on a usb drive

One solution is to install R Portable on a portable usb drive.  You can then run this software on university or other computers directly from that drive, if the computer is set up to give you permission to do so.  You should check to see if you will have permission to run portable software from a usb drive on these computers.

 

Using university computers

R and RStudio are installed on Rutgers University computers in computer laboratories.  You should check with the individual computer lab, and make sure you will have permission to install additional R packages on those machines.

 

Using R online

There are websites on which you can run R in an online environment.  r-fiddle.org is one such site.  However, I have had trouble using some R packages used in this book on r-fiddle.

 

apps.rutgers.edu

R and RStudio are also included in the apps.rutgers.edu environment.  You can log in at apps.rutgers.edu with your Rutgers Net ID.  Then,  Desktop  >  Menu  >  Development  >  R Studio.  In general, I have not found this to be the most convenient environment to work in.  Working on a laptop, I have found that I had to zoom out with the browser zoom to see the whole virtual desktop on my screen.  To transfer text to the desktop environment, paste the text into the clipboard using the clipboard icon at the upper right of the screen.  Also, I have had trouble trying to install additional R packages in this environment.

 

Tests for package installation

 

If you are unsure if you can install additional R packages in the environment you are working in, try the two examples below.  The psych and FSA packages may take a while for their initial installation.  The code for you to run is in blue, and the output is in red.  The output is truncated here.

 

Don’t worry too much about what the code is doing at this point.  The main point is to see if you can get output from both the psych and FSA packages.

 

The code assigns a vector of numbers to Score, and a vector of text strings to Student.  It then combines those two into a data frame called Data, which is then printed.  The summary function counts the values in Student, and determines the median and other statistics for Score.  The psych package is installed, then loaded with the library function, and then is used to output summary statistics for Score for each Student.  The same is then done with the FSA package.

 

Remember to run only the blue code.  The red code is the (truncated) output R should produce.


Score = c(10, 9, 8, 7, 7, 8, 9, 10, 6, 5, 4, 9, 10, 9, 10)
Student = c("Bugs", "Bugs", "Bugs", "Bugs","Bugs",
            "Daffy", "Daffy", "Daffy", "Daffy", "Daffy",
            "Taz", "Taz", "Taz", "Taz", "Taz")

Data = data.frame(Student, Score)

Data


   Student Score
1     Bugs    10
2     Bugs     9
3     Bugs     8
4     Bugs     7
5     Bugs     7
6    Daffy     8
7    Daffy     9
8    Daffy    10
9    Daffy     6
10   Daffy     5
11     Taz     4
12     Taz     9
13     Taz    10
14     Taz     9
15     Taz    10


summary(Data)


  Student      Score      
 Bugs :5   Min.   : 4.000 
 Daffy:5   1st Qu.: 7.000 
 Taz  :5   Median : 9.000 
           Mean   : 8.067 
           3rd Qu.: 9.500 
           Max.   :10.000


if(!require(psych)){install.packages("psych")}

library(psych)

describeBy(x = Score,
           group = Student)


group: Bugs
  vars n mean  sd median trimmed  mad min max range skew kurtosis   se
1    1 5  8.2 1.3      8     8.2 1.48   7  10     3 0.26    -1.96 0.58
------------------------------------------------------------------------
group: Daffy
  vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 5  7.6 2.07      8     7.6 2.97   5  10     5 -0.11    -2.03 0.93
------------------------------------------------------------------------
group: Taz
  vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 5  8.4 2.51      9     8.4 1.48   4  10     6 -0.97    -1.04 1.12


if(!require(FSA)){install.packages("FSA")}

library(FSA)

Summarize(Score ~ Student,
          data=Data)


  Student n mean       sd min Q1 median Q3 max percZero
1    Bugs 5  8.2 1.303840   7  7      8  9  10        0
2   Daffy 5  7.6 2.073644   5  6      8  9  10        0
3     Taz 5  8.4 2.509980   4  9      9 10  10        0


Required readings

 

The following readings are required for this chapter.  You can read them at the individual links below, or as chapters in the pdf version of the R Companion to the Handbook of Biological Statistics (rcompanion.org/documents/RCompanionBioStatistics.pdf).

 

About R

rcompanion.org/rcompanion/a_04.html

 

Obtaining R

rcompanion.org/rcompanion/a_05.html

 

A Few Notes to Get Started with R

rcompanion.org/rcompanion/a_06.html

 

Avoiding Pitfalls in R

rcompanion.org/rcompanion/a_07.html

 

Help with R

rcompanion.org/rcompanion/a_08.html

 

R Tutorials

rcompanion.org/rcompanion/a_09.html

 

References for this chapter

 

Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09.

rcompanion.org/rcompanion/. (Pdf version: rcompanion.org/documents/RCompanionBioStatistics.pdf.)

 

Exercises A

 

1.  Install R and RStudio on your computer, or determine how you will access this software.  Be sure that you will be able to install additional R packages (as in “Tests for package installation” above).

 

If you haven't installed the FSA package, do so with the following command, or use Tools  >  Install packages  in RStudio.

 

if(!require(FSA)){install.packages("FSA")}

 

Use the following code to import from the internet a data frame of river stage measurements for Greenwich, NJ.

 

a. Summarize the stage measurements by year and report the results.

 

Greenwich = read.table("http://rcompanion.org/documents/Greenwich.csv",
                        header=TRUE, sep=",")

library(FSA)

Summarize(Stage ~ Year,
          data = Greenwich)

 

Examine the variables in the data frame.  (Don’t report anything.)


str(Greenwich)


library(psych)

headTail(Greenwich)

summary(Greenwich)

 

2.  Run the sample code snippets in the “Required readings” above, and in the “Tests for package installation above”.  Try understanding the code.  Modify the data in the examples and examine the output.  You should be comfortable running short programs in R and examining the results.

The following will create a data frame called TwoCats with Pepé’s and Penelope’s scores from a talent show, and produce summary statistics for Penelope’s scores.

 

a. Report the results for Penelope.

 

Change Pepé’s scores to 9, 9, 7, 7, 6, 5.  Change the exclude option in the Summarize function to exclude only Penelope.  Penelope’s name will need to be in simple double quotes. 

 

b. Report the results for the summary statistics for Pepé’s new scores.  (The new mean for Pepe should be 7.1667.)


Input = ("
Cat            Score
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
'Pepé Le Pew'   8
Penelope       10
Penelope        9
Penelope        9
Penelope        8
Penelope        7
Penelope        7
")

TwoCats = read.table(textConnection(Input),header=TRUE)

library(FSA)

Summarize(Score ~ Cat,
          data = TwoCats,
          exclude = "Pepé Le Pew")

 

3.  The package FSA contains a data set called ChinookArg that has the lengths and weights of Chinook salmon at three locations.

 

library(FSA)

data(ChinookArg)

 

We can find some information about the data set with

 

?ChinookArg

 

a.  From what county are these data?

 

We can see the first and last rows of the data with

 

library(FSA)

headtail(ChinookArg)

 

Note that the variable tl is length, w is weight, and loc is location.  This is explained in the help file pulled up with

 

?ChinookArg

 

We can then get some summary statistics about the data.

 

library(FSA)

Summarize(tl ~ loc,
          data = ChinookArg)

 

This will provide summary statistics about length at each location.

 

In the output, n is the number of observations for that group.

 

This example uses formula notation, where tl is measurement variable and loc is the grouping variable, and they are separated with a tilde.  We could also think of tl as the dependent variable and loc as the independent variable. 

 

Some functions accept formula notation and some do not.  Usually asking for help about the function will help you determine what input is required and what other arguments can be passed to the function, e.g.

 

?Summarize

 

Answer the following:

 

b.  What is the number of observations for Petrohue?

 

c . What is the mean length for Chinook in Petrohue?

 

d.  The minimum length in Petrohue?

 

e.  The maximum length in Petrohue?

 

4.  Install the ggplot2 package with following command, or use Tools  >  Install packages  in RStudio.

 

if(!require(ggplot2)){install.packages("ggplot2")}

 

The following code will plot Chinook length vs. weight.  It will add a smooth line to the plot.

 

library(ggplot2)

qplot(x    = w,
      y    = tl,
      data = ChinookArg,
      geom = c("point", "smooth"),
      xlab = "Weight (kg)",
      ylab = "Length (cm)",
      main = "Chinook plot by Sal")

 

a. In the code above, change "Sal" to your name.  Export the plot and embed it in your assignment.

 

In RStudio, in the Plot window, you can try the Export menu, and save as image, save as pdf, or copy to clipboard

 

For assignments, probably the easiest thing is to use Export, save as image.  The recommended format is .png.  Pay attention to the directory where your image was saved.

 

For high resolution images, I tend to use the .pdf option and then edit the file with Photoshop or GIMP.

 

If you don't use RStudio, or would like to specify the size and resolution of image files, you can use

 

png(filename = "Rplot%03d.png",

    width  = 4,
    height = 3,
    units  = "in",
    res    = 600)

   

qplot(x    = w,
      y    = tl,

      data = ChinookArg,

      geom = c("point", "smooth"),

      xlab = "Weight (kg)",

      ylab = "Length (cm)",

      main = "Chinook plot by Sal")   

 

dev.off()

 

You may use the following to see where the file was saved, if you didn’t specify a path in the filename argument.

 

getwd()



5.  In the qplot function above, switch the variable for weight, w, with the variable for location, loc.  Remove the whole line with the geom argument.  And change the xlab argument to something appropriate.

 

a.  Export the plot and embed it in your assignment.

 

 

6.  There's another data set in the FSA package called WhitefishLC.

 

Using the command

 

?WhitefishLC

 

a.    What are the units for fish length in this data set?

 

 

Using this data set, summarize the length of whitefish (tl) by their age (scale1).

 

Use the code for Summarize that you used in 3.b.  Make sure you change the names of the variables in the function and the name of the data frame in the function.

 

b.  Report the results.

 

c.  Do you observe anything interesting in these results?

 

 

Plot length vs. age for whitefish.

 

Use the code you used in 4.  Be sure to change the name of the variables and of the data frame.

  

d  Embed this plot with your assignment.  Include appropriate axis labels and be sure the units of length are correct.  If you include a title, be sure it is appropriate.

 

e. How would you describe the relationship of length and age for these fish?