[banner]

An R Companion for the Handbook of Biological Statistics

Salvatore S. Mangiafico

A Few Notes to Get Started with R

Packages used in this chapter

 

The following commands will install these packages if they are not already installed:


if(!require(dplyr)){install.packages("dplyr")}
if(!require(psych)){install.packages("psych")}

A cookbook approach

 

The examples in this book follow a “cookbook” approach as much as possible.  The reader should be able to modify the examples with her own data and change the options and variable names as needed.  This is more obvious with some examples than others, depending on the complexity of the code.

 

Color coding in this book

 

The text in blue in this book is R code that can be copied, pasted, and run in R.  The text in red is the expected result and should not be run.  In most cases I have truncated the results and included only the most relevant parts.  Comments are in green.  It is fine to run comments, but they have no effect on the results.

 

Copying and pasting code

 

From the website

Copying the R code pieces from the website version of this book should work flawlessly.  Code can be copied from the webpages and pasted into the R console, the RStudio console, the RStudio editor, or a plain text file.  All line breaks and formatting spaces should be preserved. 

 

The only issue you may encounter is that if you paste code into the RStudio editor, leading spaces may be added to some lines.  This is not usually a problem, but a way to avoid this is to paste the code into a plain text editor, save that file as a .R file, and open it from RStudio.

 

From the pdf

Copying the R code from the pdf version of this book may work less perfectly.  Formatting spaces and even line breaks may be lost.  Different pdf readers may behave differently. 

 

It may help to paste the copied code in to a plain text editor to clean it up before pasting into R or saving it as a .R file.   Also, if your pdf reader has a select tool that allows you to select text in a rectangle, that works better in some readers.

 

A sample program

 

The following is an example of code for R that creates a vector called x and a vector called y, performs a correlation test between x and y, and then plots y vs. x.

 

This code can be copied and pasted into the console area of R or RStudio, or into the editor area of RStudio, and run.  You should get the output from the correlation test and the graphical output of the plot.

 

x = c(1,2,3,4,5,6,7,8,9)  # create a vector of values and call it x
y = c(9,7,8,6,7,5,4,3,1)

cor.test(x,y)             # perform correlation test

plot(x,y)                 # plot y vs. x

 

You can run fairly large chunks of code with R, though it is probably better to run smaller pieces, examining the output before proceeding to the next piece.

 

This kind of code can be saved as a file in the editor section of RStudio, or can be stored separately as a plain text file.  By convention files for R code are saved as .R files.  These files can be opened and edited with either a plain text editor or with the RStudio editor.

 

Assignment operators

 

In my examples I will use an equal sign, =, to assign a value to a variable.

 

height = 127.5

 

In examples you find elsewhere, you will more likely see a left arrow, <-, used as the assignment operator.

 

height <- 127.5

 

These are essentially equivalent, but I think the equal sign is more readable for a beginner.

 

Comments

 

Comments are indicated with a number sign, #.  Comments are for human readers, and are not processed by R.

 

Installing and loading packages

 

Some of the packages used in this book do not come with R automatically but need to be installed as add-on packages.  For example, if you wanted to use a function in the psych package to calculate the geometric mean of x in the sample program above:

 

x = c(1,2,3,4,5,6,7,8,9) 

 

First you would need to the install the package psych:

 

install.packages("psych")

 

Then load the package:

 

library(psych)

 

You may then use the functions included in the package:

 

geometric.mean(x)


[1] 4.147166

 

In future sessions, you will need only to load the package; it should still be in the library from the initial installation.

 

If you see an error like the following, you may have misspelled the name of the package, or the package has not been installed.

 

library(psych)


Error in library(psych) : there is no package called ‘psych’

 

Data types

 

There are several data types in R.  Most commonly, the functions we are using will ask for input data to be a vector, a matrix, or a data frame.  Data types won’t be discussed extensively here, but the examples in this book will read the data as the appropriate data type for the selected analysis.

 

Creating data frames from a text string of data

 

For certain analyses you will want to select a variable from within a data frame.  In most examples using data frames, I’ll create the data frame from a text string that allows us to arrange the data in columns and rows, as we normally visualize data.

 

A data frame can be created with the read.table function.  Note that the text for the table is enclosed in simple double quotes and parentheses.  read.table is pretty tolerant of extra spaces or blank lines.  But if we convert a data frame to a matrix—which we will later—with as.matrix—I’ve had errors from trailing spaces at the ends of lines. 

 

Values in the table that will have spaces or special characters can be enclosed in simple single quotes (e.g. 'Spongebob & Patrick').

 

D1 = read.table(header=TRUE, stringsAsFactors=TRUE, text="
Gender  Height
male    175
male    176
female  162
female  165
")

D1


  Gender Height

1   male    175

2   male    176

3 female    162

4 female    165


Reading data from a file

 

R can also read data from a separate file.  For longer data sets or complex analyses, it is helpful to keep data files and r code files separate.  For example,

 

D2 = read.table("GenderHeight.dat", header=TRUE, stringsAsFactors=TRUE)


would read in data from a file called female-male.dat found in the working directory.  In this case the file could be a space-delimited text file:

 

Sex      Height

male     175

male     176

female   162

female   165

 

Or, with read.csv,

 

D2 = read.csv("GenderHeight.csv", header=TRUE, stringsAsFactors=TRUE)

 

for a comma-separated file.

 

Gender,Height

male,175

male,176

female,162

female,165

 

D2

 

  Gender Height

1   male    175

2   male    176

3 female    162

4 female    165


RStudio also has an easy interface in the Tools menu to import data from a file.

 

The getwd function will show the location of the working directory, and setwd can be used to set the working directory.

 

getwd()

 

[1] "C:/Users/Salvatore/Documents"


setwd("C:/Users/Salvatore/Desktop")

 

Alternatively, file paths or URLs can be designated directly in the read.table function.

 

D3 = read.csv("https://rcompanion.org/documents/GenderHeight.csv",
               header=TRUE, stringsAsFactors=TRUE)

D3

 

  Gender Height

1   male    175

2   male    176

3 female    162

4 female    165


Variables within data frames

 

For the data frame D1created above, to look at just the variable Gender in this data frame:

 

D1$Gender

 

[1] male   male   female female

Levels: female male

 

 

Note that D1$Height is a vector of numbers.

 

D1$Height

 

[1] 175 176 162 165

 

 

So if you wanted the mean for this variable:

 

mean(D1$Height)

 

[1] 169.5

 

Using dplyr to create new variables in data frames

 

The standard method to define new variables in data frames is to use the data.frame$ variable syntax.  So if we wanted to add a variable to the D1 data frame above which would double Height:

 

D1$ Double = D1$ Height * 2      # Spaces are optional

D1

 

  Gender Height Double

1   male    175    350

2   male    176    352

3 female    162    324

4 female    165    330

 

Another method is to use the mutate function in the dplyr package:

 

library(dplyr)

D1 =
mutate(D1,
       Triple = Height*3,
       Quadruple = Height*4)

D1

 

  Gender Height Double Triple Quadruple

1   male    175    350    525       700

2   male    176    352    528       704

3 female    162    324    486       648

4 female    165    330    495       660

 

The dplyr package also has functions to select only certain columns in a data frame (select function) or to filter a data frame by the value of some variable (filter function).  It can be helpful for manipulating data frames.

 

In the examples in this book, I will use either the $ syntax or the mutate function in dplyr, depending on which I think makes the example more comprehensible.

 

Extracting elements from the output of a function

 

Sometimes it is useful to extract certain elements from the output of an analysis.  For example, we can assign the output from a binomial test to a variable we’ll call Test.

 

Test = binom.test(7, 12, 3/4,
                  alternative="less",
                  conf.level=0.95)

 

To see the value of Test:

 

Test

 

Exact binomial test

 

number of successes = 7, number of trials = 12, p-value = 0.1576

 

95 percent confidence interval:

 0.0000000 0.8189752


To see what elements are included in Test:

 

names(Test)

 

[1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
[8] "method"      "data.name"


Or with more details:

 

str(Test)

 

To view the p-value from Test:

 

Test$ p.value

 

[1] 0.1576437

 

 

To view the confidence interval from Test:

 

Test$ conf.int

 

[1] 0.0000000 0.8189752

 

[1] 0.95

 

 

To view the upper confidence limit from Test:

 

Test$ conf.int[2]

 

[1] 0.8189752

 

 

Exporting graphics

 

R has the ability to produce a variety of plots.  Simple plots can be produced with just a few lines of code.  These are useful to get a quick visualization of your data or to check on the distribution of residuals from an analysis.  More in-depth coding can produce publication-quality plots.

 

Exporting plots from the RStudio window

In the RStudio Plots window, there is an Export icon which can be used to save the plot as image or pdf file.  A method I use is to export the plot as pdf and then open this pdf with either Adobe Photoshop or the free alternative, GIMP (www.gimp.org/).  These programs allow you to import the pdf at whatever resolution you need, and then crop out extra white space.

 

The appearance of exported plots will change depending on the size and scale of exported file.  If there are elements missing from a plot, it may be because the size is not ideal.  Changing the export size is also an easy way to adjust the size of the text of a plot relative to the other elements.

 

An additional trick in RStudio is to change the size of the plot window after the plot is produced, but before it is exported.  Sometimes this can get rid of problems where, for example, words in a plot legend are cut off.

 

Finally, if you export a plot as a pdf, but still need to edit it further, you can open it in Inkscape, ungroup the plot elements, adjust some plot elements, and then export as a high-resolution bitmap image.  Just be sure you don’t change anything important, like how the data line up with the axes.

 

Exporting plots directly as a file

R also allows for the direct exporting of graphics as a .bmp, .jpg, .png, or .tif file. See ?png for details.  This method allows you to specify the dimensions and resolution of the outputted image.

 

Note that dev.off() is used afterwards to redirect future output to its usual channel. 

 

### Optional code to set the directory where the image will be saved

setwd("C:/Users/Salvatore/Desktop")


### Create data frame

 D4 = read.table(header=TRUE, stringsAsFactors=TRUE, text="
TolkienRace AvgHeight
Dwarf       130
Hobbit      105
Man         165
Elf         170
Orc         125
")


### Output a plot as a .png file

png(filename = "TolkienPlot.png",
    width  = 5,
    height = 3.75,
    units  = "in",
    res    = 300)
    
barplot(AvgHeight ~ TolkienRace, data=D4)

dev.off()