R Handbook: Using R

R and RStudio

This book will use the software package R Project for Statistical Computing to create plots and conduct statistical analyses. It is free to install on a Windows, Mac, or Linux computer. Although it is not required, I also recommend using RStudio, which is also free.

Using the RStudio environment

RStudio provides a nice work environment because it presents several windows on the screen that make it easy to view code, results, and plots at once.

Program code can be worked on in the upper left Script window. If that window isn’t displayed, it can be opened with File > New File > R Script. Code in the Script window is selected, and the Run button is used to run the code. The results are reported in the Console window on the lower left. Code can be saved as an .r or .R file, and those files should subsequently open automatically with RStudio.

Output can be copied from the Console window as text.

Smaller pieces of code can be typed or pasted directly in the Console window.

The lower right window will usually show either plot results or help results. Plots can be saved as png or jpg files.

Using the R Console environment

If you are not using RStudio, it is advisable to keep code stored in a separate text file, and then code can be pasted as small chunks directly into the R GUI Console. Results are produced in the Console as code is entered, and plots will open in a separate window.

Installing R and RStudio

In theory it should be easy to install R and R Studio on your computer. It works with Windows, Mac, and Linux operating systems.

First install R.

cran.r-project.org/

Then install RStudio. Choose the free desktop version.

posit.co/download/rstudio-desktop/

You may need to tell RStudio where your R installation is. In RStudio this can be changed with:

Tools > Global Options

From there, you simply start RStudio when you want to use R.

Optional: Problems with library folder location in Windows 10

I had some problems with R on Windows 10 when installing additional packages. The issue is that by default R wants to place installed packages (“library” folder) in the same location where R resides. The problem is that Windows doesn’t allow programs to install things in the Program Files folder.

One option is to install R somewhere else on the computer. That is, not in the Program Files folder.

A second option is a temporary fix. With R installed in Program Files, run the following code first whenever you start an R session. You will need to change the path indicated.

.libPaths("C:/Users/Mangiafico/R/library")
.libPaths()

The first path listed should be the path you added.

A better fix. With R installed in Program Files, you can tell R where to install packages (“library”). Here I’ll try to recount what I did that worked.

a) Right click on the shortcut icon that you use for RStudio. Properties > Start In. Change it to e.g. C:\Users\Mangiafico\R

b) Change RStudio > Tools > Global Options > Default Working Directory. To e.g. C:\Users\Mangiafico\R , or whatever you used in a).

c) Run the following code. It will tell you where there are Rprofile and Rprofile.site files. Delete, or move those to somewhere harmless.

candidates <- c( Sys.getenv("R_PROFILE"),
                 file.path(Sys.getenv("R_HOME"), "etc", "Rprofile.site"),
                 Sys.getenv("R_PROFILE_USER"), file.path(getwd(), ".Rprofile") )

Filter(file.exists, candidates)

d) Download my .Rprofile file. Open it with a text editor (Right click > Open with > Notepad). Note that it doesn’t have a file extension (i.e. it doesn’t have a .txt extension). Change the path in there to where you want your library folder to be. Include "library" in the path.

rcompanion.org/documents/rprofile.html

Save it and move it to the directory you used for a) and b).

Do the same with the Rprofile.site file. Save this one in Program Files > R > (Version) > etc , or wherever the Rprofile.site file was listed in c).

You can modify these files if you wish. It’s important that you don’t add file extensions to their names. The .First function lists commands to run when R starts up. I just have it report what is the R home directory, working directory, and where the library folder is stored.

e) Exit out of R and RStudio. Start RStudio as you normally would.

Enter the following in the console.

.libPaths()

And the first path listed should be the path you added to the .Rprofile file.

If these don’t work, you may need to do some googling and find a solution. If you find a good solution, let the class know.

What if I don’t have my own computer?

If you don’t have your own computer on which to install the software, there are a few options.

Portable installation on a usb drive

One solution is to install R Portable on a portable usb drive. You can then run this software on university or other computers directly from that drive, if the computer is set up to give you permission to do so. You should check to see if you will have permission to run portable software from a usb drive on these computers.

Using university computers

R and RStudio are installed on Rutgers University computers in computer laboratories. You should check with the individual computer lab, and make sure you will have permission to install additional R packages on those machines.

Installing R packages on a university computer

If there are issues with installing additional R packages on university computers, perhaps the easiest solution would be to always change the location of the library folder at the start of each session. For example, if you were using a usb flash drive assigned as the D: drive:

### Set location of library folder
.libPaths("D:/R/library")

### Set location of working directory
setwd("D:/R")

### Check these
.libPaths()
getwd()

The first path listed for .libPaths() should be the directory you requested. If it is available, it is where R will install new packages.

These commands should be run at the start of every session.

Using R online

There are websites on which you can run R in an online environment. At the time of writing, I like the rdrr.io/snippets site because it has common packages installed. There, you can paste code into the window, and press the Run button.

Tests for package installation

If you are unsure if you can install additional R packages in the environment you are working in, try the two examples below.

The psych and FSA packages may take a while for their initial installation. The code for you to run is in blue, and the output is in red. The output is truncated here.

Don’t worry too much about what the code is doing at this point. The main point is to see if you can get output from both the psych and FSA packages.

The code assigns a vector of numbers to Score, and a vector of text strings to Student. It then combines those two into a data frame called Data, which is then printed. The summary function counts the values in Student, and determines the median and other statistics for Score. The psych package is installed, then loaded with the library function, and then is used to output summary statistics for Score for each Student. The same is then done with the FSA package.

Remember to run only the blue code. The red code is the (truncated) output R should produce.

Score = c(10, 9, 8, 7, 7, 8, 9, 10, 6, 5, 4, 9, 10, 9, 10)
Student = c("Bugs", "Bugs", "Bugs", "Bugs","Bugs",
"Daffy", "Daffy", "Daffy", "Daffy", "Daffy",
"Taz", "Taz", "Taz", "Taz", "Taz")

Data = data.frame(Student, Score)

Data

   Student Score
1     Bugs    10
2     Bugs     9
3     Bugs     8
4     Bugs     7
5     Bugs     7
6    Daffy     8
7    Daffy     9
8    Daffy    10
9    Daffy     6
10   Daffy     5
11     Taz     4
12     Taz     9
13     Taz    10
14     Taz     9
15     Taz    10

summary(Data)

Student      Score
Bugs :5   Min.   : 4.000
Daffy:5   1st Qu.: 7.000
Taz :5   Median : 9.000
           Mean   : 8.067
           3rd Qu.: 9.500
           Max.   :10.000

if(!require(psych)){install.packages("psych")}

library(psych)

describeBy(x = Score,
group = Student)

group: Bugs
vars n mean sd median trimmed mad min max range skew kurtosis   se
1    1 5 8.2 1.3      8     8.2 1.48   7 10     3 0.26    -1.96 0.58
------------------------------------------------------------------------
group: Daffy
vars n mean   sd median trimmed mad min max range skew kurtosis   se
1    1 5 7.6 2.07      8     7.6 2.97   5 10     5 -0.11    -2.03 0.93
------------------------------------------------------------------------
group: Taz
vars n mean   sd median trimmed mad min max range skew kurtosis   se
1    1 5 8.4 2.51      9     8.4 1.48   4 10     6 -0.97    -1.04 1.12

if(!require(FSA)){install.packages("FSA")}

library(FSA)

Summarize(Score ~ Student,
data=Data)

Student n mean       sd min Q1 median Q3 max percZero
1    Bugs 5 8.2 1.303840   7 7      8 9 10        0
2   Daffy 5 7.6 2.073644   5 6      8 9 10        0
3     Taz 5 8.4 2.509980   4 9      9 10 10        0

Entering data frames in R

In R, a data set can be entered in different ways.

First, we’ll load the external package FSA, which will make some data summary easy.

if(!require(FSA)){install.packages("FSA")}

Combining vectors into a data frame

One simple way to construct a data frame in R is to create vectors for individual variables and then combine them into a data frame. One thing to note here is that the variables generally need to all have the same length. For example, here both Dog and CuteScore have 12 observations.

Dog = c("Kelly", "Kelly", "Kelly", "Kelly",
"Mr. Bruce", "Mr. Bruce", "Mr. Bruce", "Mr. Bruce",
"Daisy", "Daisy", "Daisy", "Daisy")

CuteScore = c(9, 10, 8, 9, 8, 9, 10, 10, 9, 9, 10, 10)

MyDogs1 = data.frame(Dog, CuteScore)

str(MyDogs1)

'data.frame': 12 obs. of 2 variables:
$ Dog : chr "Kelly" "Kelly" "Kelly" "Kelly" ...
$ CuteScore: num 9 10 8 9 8 9 10 10 9 9 ...

summary(MyDogs1)

    Dog              CuteScore
Length:12          Min.   : 8.00
Class :character   1st Qu.: 9.00
Mode :character   Median : 9.00
                    Mean   : 9.25
                    3rd Qu.:10.00
                    Max.   :10.00

library(FSA)

Summarize(CuteScore ~ Dog, data = MyDogs1)

        Dog n mean        sd min   Q1 median    Q3 max
1     Daisy 4 9.50 0.5773503   9 9.00    9.5 10.00 10
2     Kelly 4 9.00 0.8164966   8 8.75    9.0 9.25 10
3 Mr. Bruce 4 9.25 0.9574271   8 8.75    9.5 10.00 10

Using rep() to create a vector of repeated values

A shortcut to create a vector of repeated values is to use the rep() function.

Dog = c( rep("Kelly", 4), rep("Mr. Bruce", 4), rep("Daisy", 4) )

Dog

"Kelly" "Kelly" "Kelly" "Kelly" "Mr. Bruce" "Mr. Bruce"
"Mr. Bruce" "Mr. Bruce" "Daisy" "Daisy" "Daisy" "Daisy"

This could also be expressed as

Dog = rep(c("Kelly", "Mr. Bruce", "Daisy"), each = 4)

Dog

"Kelly" "Kelly" "Kelly" "Kelly" "Mr. Bruce" "Mr. Bruce"
"Mr. Bruce" "Mr. Bruce" "Daisy" "Daisy" "Daisy" "Daisy"

Entering data as a space-separated table

Another simple way to enter data into a data frame is to use read.table() with space-separated data.

Note that, here, you need single quotes inside the text if a value contains a space, like does Mr. Bruce.

MyDogs2 = read.table(header=TRUE, stringsAsFactors=TRUE, text="
Dog   CuteScore
Kelly        9
Kelly       10
Kelly        8
Kelly        9
'Mr. Bruce' 8
'Mr. Bruce' 9
'Mr. Bruce' 10
'Mr. Bruce' 10
Daisy        9
Daisy        9
Daisy       10
Daisy       10
")

str(MyDogs2)

summary(MyDogs2)

library(FSA)

Summarize(CuteScore ~ Dog, data = MyDogs2)

Entering data as a comma-separated table

You can also use read.table() with comma-separated data.

Note that, here, you don’t need quotes inside the text for Mr. Bruce.

MyDogs3 = read.table(header=TRUE, stringsAsFactors=TRUE, sep=",", text="
Dog,   CuteScore
Kelly,       9
Kelly,      10
Kelly,       8
Kelly,       9
Mr. Bruce,   8
Mr. Bruce,   9
Mr. Bruce, 10
Mr. Bruce, 10
Daisy,       9
Daisy,       9
Daisy,      10
Daisy,      10
")

str(MyDogs3)

summary(MyDogs3)

library(FSA)

Summarize(CuteScore ~ Dog, data = MyDogs3)

Reading a .csv file

In real-world projects, one of the most common ways to import data into R is use an external .csv file. The read.csv() function has a variety of options to help import .csv files that may have different formats. You can see these options with

?read.csv

Reading a .csv file

The following code imports a .csv file from the internet. However, the path in quotes could be a file name path on the local computer instead.

MyDogs = read.csv("https://rcompanion.org/documents/MyDogs.csv", header=TRUE)

str(MyDogs)

'data.frame': 12 obs. of 2 variables:
$ Dog : chr "Kelly" "Kelly" "Kelly" "Kelly"
$ CuteScore: num 9 10 8 9 8 9 10 10 9 9

summary(MyDogs)

Dog              CuteScore
Length:12          Min.   : 8.00
Class :character   1st Qu.: 9.00
Mode :character   Median : 9.00
Mean   : 9.25
3rd Qu.:10.00
Max.   :10.00

library(FSA)

Summarize(CuteScore ~ Dog, data = MyDogs)

Setting the working directory

A useful trick is to set the working directory to a folder containing the R code and .csv files for a project, so that the full path of the .csv file doesn’t need to be specified when it’s imported.

getwd()

[1] "C:/Users/Sal Mangiafico/Documents"

setwd("C:/Users/Sal Mangiafico/Desktop")

Another tick I often use to start RStudio by double clicking on the .r file within the correct folder. This sets the working directory as the folder with the .r file.

Reading tab-delimited data

R can also read tab-delimited data. Sometimes .csv files are tab-delimited, and sometimes tab-delimited data are stored in .txt, .tab, or .tsv files.

The representation of tabs in R or RStudio is sometimes inconsistent, so I don’t recommend using tabs to separate data within an R script.

MyDogs4 = read.table("https://rcompanion.org/documents/MyDogsTabs.txt", header=TRUE, stringsAsFactors=TRUE, sep="\t")

str(MyDogs4)

summary(MyDogs4)

library(FSA)

Summarize(CuteScore ~ Dog, data = MyDogs4)

Importing other file types

There are also packages to import Excel files and even Word files. I don’t do this often, so I won’t recommend any specific packages.

Personally, I find it a good idea to wrangle the data into a simple .csv format before importing to R. If necessary, .csv files can be edited in a text editor, word processor like Word, or a spreadsheet like Excel. This is helpful if there are any unusual values that need to be cleaned up or if you need to use search and replace functionality.

Required readings

The following readings are required for this chapter. You can read them at the individual links below, or as chapters in the pdf version of the R Companion to the Handbook of Biological Statistics (rcompanion.org/documents/RCompanionBioStatistics.pdf).

References

Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09.

rcompanion.org/rcompanion/. (Pdf version: rcompanion.org/documents/RCompanionBioStatistics.pdf.)

Exercises A

1. Install R and RStudio on your computer, or determine how you will access this software. Be sure that you will be able to install additional R packages (as in “Tests for package installation” above).

If you haven't installed the FSA and psych packages, do so with the following command, or use Tools > Install packages in RStudio.

if(!require(FSA)){install.packages("FSA")}
if(!require(psych)){install.packages("psych")}

Use the following code to import from the internet a data frame of river stage measurements for Greenwich, NJ.

Greenwich = read.table("https://rcompanion.org/documents/Greenwich.csv",
header=TRUE, sep=",")

a. Summarize the stage measurements by year and report the results.

library(FSA)

Summarize(Stage ~ Year,
data = Greenwich)

Examine the variables in the data frame. (Don’t report anything.)

str(Greenwich)

library(psych)

headTail(Greenwich)

summary(Greenwich)

The str function reports the type of each variable in the data frame. Agency is treated as a factor variable by R. We’ll consider this a nominal variable. It has one level, which is “USGS”. The rest of the variables are being treated as integer or numeric variables.

The function headTail in the psych package shows you the top and bottom of the data frame, arranged in the usual way with variables in columns and observations in rows.

The summary function reports some summary statistics for the data frame. In this case, it reports that there are 171 instances of “USGS” for the variable Agency. Since it treats Year as an integer variable, it tells you that the minimum is 2000 and the maximum is 2014. It tells us that there are 6 NA values in the variable Stage.

These functions are useful to see what variables are in a data frame, and to be sure that the data frame reflects the data we think it should. For example, this data frame has 171 observations. If we were expecting 1000 observations, we would know something went wrong in our data entry or importing of the data.

We could create a new variable called Year.f that would be the same as Year, but treated as a factor variable.

Greenwich$Year.f = as.factor(Greenwich$Year)

Now when we use the summary function, it will report the counts for some levels of Year. f.

summary(Greenwich)

b. In future chapters, we will learn to summarize data in more specific ways. In this case, if we want the counts for all the years in Year.f, we can do the following. Report this result.

xtabs(~ Year.f,
data = Greenwich)

2. The following will create a data frame called TwoCats with Leo Verdura and Katya’s scores from a talent show, and produce summary statistics for Katya’s scores.

TwoCats = read.table(header=TRUE, stringsAsFactors=TRUE, text="

Cat            Score
'Leo Verdura'   8
'Leo Verdura'   8
'Leo Verdura'   8
'Leo Verdura'   8
'Leo Verdura'   8
'Leo Verdura'   8
Katya          10
Katya           9
Katya           9
Katya           8
Katya           7
Katya           7
")

library(FSA)

Summarize(Score ~ Cat,
          data = TwoCats,
          exclude = "Leo Verdura")

a. Report the results for Katya.

Change Leo’s scores to 9, 9, 7, 7, 6, 5. Change the exclude option in the Summarize function to exclude only Katya. Katya’s name will need to be in simple double quotes.

b. Report the results for the summary statistics for Leo’s new scores. (The new mean for Leo should be 7.1667.)

c. Summarize the data below with the Summarize function in the FSA package. Report the results.

To do this,

• Copy the data below and paste it into to code above. Make sure you retain the read.table line and the ") line.

Dog      Snacks
Scooby    4
Scooby    3
Scooby    6
Scooby   18
Scooby    7
Scrappy   8
Scrappy 10
Scrappy   6
Scrappy   5
Scrappy 15
Scrappy   7
Scrappy   9

• Get rid of the exclude option in the Summarize call. Pay attention to the placement of the commas and parentheses, so it will look like this:

Summarize(Score ~ Cat,
data = TwoCats)

• In the code, change TwoCats to TwoDogs.

• In the Summarize function call, change TwoCats to TwoDogs, change Score to Snacks, change Cat to Dog.

You should get summary statistics for both Scooby and Scrappy in one output. The number of observations (n) for Scooby should be 5 and for Scrappy should be 7. The mean snacks for Scooby should be 7.6 and the mean snacks for Scrappy should be 8.57.

Exercises Alpha

1. The package FSA contains a data set called ChinookArg that has the lengths and weights of Chinook salmon at three locations.

library(FSA)

data(ChinookArg)

We can find some information about the data set with

?ChinookArg

a. Report, From what county are these data?

We can see the first and last rows of the data with

library(psych)

headTail(ChinookArg)

Note that the variable tl is length, w is weight, and loc is location. This is explained in the help file pulled up with

?ChinookArg

We can then get some summary statistics about the data.

library(FSA)

Summarize(tl ~ loc,
data = ChinookArg)

This will provide summary statistics about length at each location.

In the output, n is the number of observations for that group. The other summary statistics are mean, standard deviation, minimum, 1^st quartile, median, 3^rd quartile, and maximum. These statistics will be discussed in a later chapter.

This example uses formula notation, where tl is measurement variable and loc is the grouping variable, and they are separated with a tilde. We could also think of tl as the dependent variable and loc as the independent variable.

Some functions accept formula notation and some do not. Usually asking for help about the function will help you determine what input is required and what other arguments can be passed to the function, e.g.

?Summarize

Answer the following:

b. What is the number of observations for Petrohue?

c . What is the mean length for Chinook in Petrohue?

d. The minimum length in Petrohue?

e. The maximum length in Petrohue?

2. Install the ggplot2 package with following command, or use Tools > Install packages in RStudio.

if(!require(ggplot2)){install.packages("ggplot2")}

The following code will plot Chinook length vs. weight. It will add a smooth line to the plot.

library(ggplot2)

qplot(x    = w,
      y    = tl,
      data = ChinookArg,
      geom = c("point", "smooth"),
      xlab = "Weight (kg)",
      ylab = "Length (cm)",
      main = "Chinook plot by Sal")

a. In the code above, change "Sal" to your name. Export the plot and embed it in your assignment.

In RStudio, in the Plot window, you can try the Export menu, and save as image, save as pdf, or copy to clipboard.

For assignments, probably the easiest thing is to use Export, save as image. The recommended format is .png. Pay attention to the directory where your image was saved.

For high resolution images, I tend to use the .pdf option and then edit the file with Photoshop or GIMP.

If you don't use RStudio, or would like to specify the size and resolution of image files, you can use

png(filename = "Rplot03d.png",

    width = 4,
    height = 3,
    units = "in",
    res    = 600)

qplot(x = w,
y = tl,

data = ChinookArg,

geom = c("point", "smooth"),

xlab = "Weight (kg)",

ylab = "Length (cm)",

main = "Chinook plot by Sal")

dev.off()

You may use the following to see where the file was saved, if you didn’t specify a path in the filename argument.

getwd()

3. In the qplot function above, switch the variable for weight, w, with the variable for location, loc. Remove the whole line with the geom argument. And change the xlab argument to something appropriate.

a. Export the plot and embed it in your assignment.

4. There's another data set in the FSA package called WhitefishLC.

Using the command

?WhitefishLC

Answer the following lettered questions.

a. What are the units for fish length in this data set?

Using this data set, summarize the length of whitefish (tl) by their age (scale1).

Use the code for Summarize that you used in 1.a. Make sure you change the names of the variables in the function and the name of the data frame in the function.

b. Report the results.

c. Do you observe anything interesting in these results?

Plot length vs. age for whitefish.

Use the code you used in 2. Be sure to change the name of the variables and of the data frame.

d Embed this plot with your assignment. Include appropriate axis labels and be sure the units of length are correct. If you include a title, be sure it is appropriate.

e. How would you describe the relationship of length and age for these fish?

Summary and Analysis of Extension Program Evaluation in R

Using R

Installing R packages on a university computer

Entering data frames in R

Combining vectors into a data frame

Using rep() to create a vector of repeated values

Entering data as a space-separated table

Entering data as a comma-separated table

Reading a .csv file

Reading a .csv file

Setting the working directory

Reading tab-delimited data

Importing other file types