[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Descriptive Statistics

Packages used in this chapter

 

The packages used in this chapter include:

•  psych

•  DescTools

•  Rmisc

•  FSA

•  plyr

•  boot

 

The following commands will install these packages if they are not already installed:


if(!require(psych)){install.packages("psych")}
if(!require(DescTools)){install.packages("DescTools")}
if(!require(Rmisc)){install.packages("Rmisc")}
if(!require(FSA)){install.packages("FSA")}
if(!require(plyr)){install.packages("plyr")}
if(!require(boot)){install.packages("boot")}

Descriptive statistics

 

Descriptive statistics are used to summarize data in a way that provides insight into the information contained in the data.  This might include examining the mean or median of numeric data or the frequency of observations for nominal data.  Plots can be created that show the data and indicating summary statistics.

 

Choosing which summary statistics are appropriate depend on the type of variable being examined.  Different statistics should be used for interval/ratio, ordinal, and nominal data. 

 

In describing or examining data, you will typically be concerned with measures of location, variation, and shape. 

 

Location is also called central tendency.  It is a measure of the values of the data.  For example, are the values close to 10 or 100 or 1000?  Measures of location include mean and median, as well as somewhat more exotic statistics like M-estimators or Winsorized means.

 

Variation is also called dispersion.  It is a measure of how far the data points lie from one another.  Common statistics include standard deviation and coefficient of variation.  For data that aren’t normally-distributed, percentiles or the interquartile range might be used.

 

Shape refers to the distribution of values.  The best tools to evaluate the shape of data are histograms and related plots.  Statistics include skewness and kurtosis, though they are less useful than visual inspection.  We can describe data shape as normally-distributed, log-normal, uniform, skewed, bi-modal, and others.

 

Descriptive statistics for interval/ratio data

 

For this example, imagine that Ren and Stimpy have each held eight workshops educating the public about water conservation at home.  They are interested in how many people showed up to the workshops.

 

Because the data are housed in a data frame, we can use the convention Data$Attendees to access the variable Attendees within the data frame Data.


Input = ("
Instructor  Location  Attendees
Ren         North      7
Ren         North     22
Ren         North      6
Ren         North     15
Ren         South     12
Ren         South     13
Ren         South     14
Ren         South     16
Stimpy      North     18
Stimpy      North     17
Stimpy      North     15
Stimpy      North      9
Stimpy      South     15
Stimpy      South     11
Stimpy      South     19
Stimpy      South     23
")

Data = read.table(textConnection(Input),header=TRUE)

Data           ### Will output data frame called Data

str(Data)      ### Will be discussed later,
               ###  but shows the structure of the data frame

summary(Data)  ### Will be discussed later,
               ###   but summarizes variables in the data frame


Functions sum and length

The sum of a variable can be found with the sum function, and the number of observations can be found with the length function.


sum(Data$Attendees)


232


length(Data$Attendees)


16


Statistics of location for interval/ratio data

 

Mean

The mean is the arithmetic average, and is a common statistic used with interval/ratio data.  It is simply the sum of the values divided by the number of values.  The mean function in R will return the mean.


sum(Data$Attendees) / length(Data$Attendees)


14.5


mean(Data$Attendees)


14.5


Caution should be used when reporting mean values with skewed data, as the mean may not be representative of the center of the data.  For example, imagine a town with 10 families, nine of whom have an income of less than $50, 000 per year, but with one family with an income of $2,000,000 per year.  The mean income for families in the town would be $233,000, but this may not be a reasonable way to summarize the income of the town.


Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000)

mean(Income)


233300


Median

The median is defined as the value below which are 50% of the observations.  To find this value manually, you would order the observations, and separate the lowest 50% from the highest 50%.  For data sets with an odd number of observations, the median is the middle value.  For data sets with an even number of observations, the median falls half-way between the two middle values.

 

The median is a robust statistic in that it is not affected by adding extreme values.  For example, if we changed Stimpy’s last Attendees value from 23 to 1000, it would not affect the median.


median(Data$Attendees)


15

   ### Note that in this case the mean and median are close in value
   ###   to one another.  The mean and median will be more different
   ###   the more the data are skewed. 


The median is appropriate for either skewed or unskewed data.  The median income for the town discussed above is $40,500.  Half the families in the town have an income above this amount, and half have an income below this amount.

 

Income = c(49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000)

median(Income)


40500


Note that medians are sometimes reported as the “average person” or “typical family”.  Saying, “The average American family earned $54,000 last year” means that the median income for families was $54,000.  The “average family” is that one with the median income.

 

Mode

The mode is a summary statistic that is used rarely in practice, but is normally included in any discussion of mean and medians.  When there are discreet values for a variable, the mode is simply the value which occurs most frequently.  For example, in the Statistics Learning Center video in the Required Readings below, Dr. Nic gives an example of counting the number of pairs of shoes each student owns.  The most common answer was 10, and therefore 10 is the mode for that data set.

 

For our Ren and Stimpy example, the value 15 occurs three times and so is the mode.

 

The Mode function can be found in the package DescTools.


library(DescTools)

Mode(Data$Attendees)


15


Statistics of variation for interval/ratio data

 

Standard deviation

The standard deviation is a measure of variation which is commonly used with interval/ratio data.  It’s a measurement of how close the observations in the data set are to the mean.

 

There’s a handy rule of thumb that—for normally distributed data—68% of data points fall within the mean ± 1 standard deviation, 95% of data points fall within the mean ± 2 standard deviations, and 99.7% of data points fall within the mean ± 3 standard deviations.

 

Because the mean is often represented with the letter mu, and the standard deviation is represented with the letter sigma, saying someone is “a few sigmas away from mu” indicates they are rather a rare character.  (I originally heard this joke on an episode of Car Talk, for which I cannot find a reference or transcript.)


sd(Data$Attendees)


4.830459

 

Standard deviation may not be appropriate for skewed data.

 

Standard error of the mean

Standard error of the mean is a measure that estimates how close a calculated mean is likely to be to the true mean of that population.  It is commonly used in tables or plots where multiple means are presented together.  For example, we might want to present the mean attendees for Ren with the standard error for that mean and the mean attendees for Stimpy with the standard error that mean.

 

The standard error is the standard deviation of a data set divided by the square root of the number of observations.  It can also be found in the output for the describe function in the psych package, labelled se.

 

sd(Data$Attendees) / 
   sqrt(length(Data$Attendees))


1.207615

 

library(psych)              

describe(Data$Attendees)

 

  vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 16 14.5 4.83     15    14.5 4.45   6  23    17 -0.04    -0.88 1.21

   ###  se indicates the standard error of the mean


Standard error of the mean may not be appropriate for skewed data.

 

Five-number summary, quartiles, percentiles

The median is the same as the 50th percentile, because 50% of values fall below this value.  Other percentiles for a data set can be identified to provide more information.  Typically, the 0th, 25th, 50th, 75th, and 100th percentiles are reported.  This is sometimes called the five-number summary.

 

These values can also be called the minimum, 1st quartile, 2nd quartile, 3rd quartile, and maximum.

 

The five-number summary is a useful measure of variation for skewed interval/ratio data or for ordinal data.  25% of values fall below the 1st quartile and 25% of values fall above the 3rd quartile.  This leaves the middle 50% of values between the 1st and 3rd quartiles, giving a sense of the range of the middle half of the data.  This range is called the interquartile range (IQR).

 

Percentiles and quartiles are relatively robust, as they aren’t affected much by a few extreme values.  They are appropriate for both skewed and unskewed data.


summary(Data$Attendees)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   6.00   11.75   15.00   14.50   17.25   23.00

      ###  The five-number summary and the mean


Optional technical note on calculating percentiles

It may have struck you as odd that the 3rd quartile for Attendees was reported as 17.25.  After all, if you were to order the values of Attendees, the 75th percentile would fall between 17 and 18.  But why does R go with 17.25 and not 17.5?


sort(Data$Attendees)


 6  7  9 11 12 13 14 15 15 15 16 17 18 19 22 23


The answer is that there are several different methods to calculate percentiles, and they may give slightly different answers.  For details on the calculations, see ?quantiles.

 

For Attendees, the default type 7 calculation yields a 75th percentile value of 17.25, whereas the type 2 calculation simply splits the difference between 17 and 18 and yields 17.5.  The type 1 calculation doesn’t average the two values, and so just returns 17.

 

quantile(Data$Attendees, 0.75, type=7)


  75%
17.25



quantile(Data$Attendees, 0.75, type=2)


 75%
17.5



quantile(Data$Attendees, 0.75, type=1)


75%
 17


 Percentiles other than the 25th, 50th, and 75th can be calculated with the quantiles function.  For example, to calculate the 95th percentile:


quantile(Data$Attendees, .95)


  95%
22.25


Confidence intervals

Confidence intervals are covered in the next chapter.

 

Statistics for grouped interval/ratio data

In many cases, we will want to examine summary statistics for a variable within groups.  For example, we may want to examine statistics for the workshops lead by Ren and those lead by Stimpy.

 

Summarize in FSA

The Summarize function in the FSA package returns the number of observations, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum for grouped data.

 

Note the use of formula notation:  Attendees is the dependent variable (the variable you want to get the statistics for); and Instructor is the independent variable (the grouping variable).  Summarize allows you to summarize over the combination of multiple independent variables by listing them to the right of the ~ separated by a plus sign (+).

 

library(FSA)

Summarize(Attendees ~ Instructor,
          data=Data)


  Instructor n nvalid   mean       sd min    Q1 median    Q3 max percZero
1        Ren 8      8 13.125 5.083236   6 10.75   13.5 15.25  22        0
2     Stimpy 8      8 15.875 4.454131   9 14.00   16.0 18.25  23        0


Summarize(Attendees ~ Instructor + Location,
          data=Data)


  Instructor Location n nvalid  mean       sd min    Q1 median    Q3 max percZero
1        Ren    North 4      4 12.50 7.505554   6  6.75   11.0 16.75  22        0
2     Stimpy    North 4      4 14.75 4.031129   9 13.50   16.0 17.25  18        0
3        Ren    South 4      4 13.75 1.707825  12 12.75   13.5 14.50  16        0
4     Stimpy    South 4      4 17.00 5.163978  11 14.00   17.0 20.00  23        0



summarySE in Rmisc

The summarySE function in the Rmisc package outputs the number of observations, mean, standard deviation, standard error of the mean, and confidence interval for grouped data.  The summarySE function allows you to summarize over the combination of multiple independent variables by listing them as a vector, e.g. c("Instructor", "Student").


library(Rmisc)

summarySE(data=Data,
          "Attendees",
          groupvars="Instructor",
          conf.interval = 0.95)


  Instructor N Attendees       sd       se       ci
1        Ren 8    13.125 5.083236 1.797195 4.249691
2     Stimpy 8    15.875 4.454131 1.574773 3.723747


summarySE(data=Data,
          "Attendees",
          groupvars = c("Instructor", "Location"),
          conf.interval = 0.95)



  Instructor Location N Attendees       sd        se        ci
1        Ren    North 4     12.50 7.505553 3.7527767 11.943011
2        Ren    South 4     13.75 1.707825 0.8539126  2.717531
3     Stimpy    North 4     14.75 4.031129 2.0155644  6.414426
4     Stimpy    South 4     17.00 5.163978 2.5819889  8.217041



describeBy in psych

The describeBy function in the psych package returns the number of observations, mean, median, trimmed means, minimum, maximum, range, skew, kurtosis, and standard error of the mean for grouped data.  describeBy allows you to summarize over the combination of multiple independent variables by combining terms with a colon (:).

 


library(psych)

describeBy(Data$Attendees,
           group = Data$Instructor,
           digits= 4)


group: Ren
  vars n  mean   sd median trimmed  mad min max range skew kurtosis  se
1    1 8 13.12 5.08   13.5   13.12 2.97   6  22    16 0.13    -1.08 1.8
-------------------------------------------------------------------------
group: Stimpy
  vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 8 15.88 4.45     16   15.88 3.71   9  23    14 -0.06    -1.26 1.57


describeBy(Data$Attendees,
           group = Data$Instructor : Data$Location,
           digits= 4)


group: Ren:North
  vars n mean   sd median trimmed  mad min max range skew kurtosis   se
1    1 4 12.5 7.51     11    12.5 6.67   6  22    16 0.26    -2.14 3.75
-------------------------------------------------------------------------
group: Ren:South
  vars n  mean   sd median trimmed  mad min max range skew kurtosis   se
1    1 4 13.75 1.71   13.5   13.75 1.48  12  16     4 0.28    -1.96 0.85
-------------------------------------------------------------------------
group: Stimpy:North
  vars n  mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 4 14.75 4.03     16   14.75 2.22   9  18     9 -0.55    -1.84 2.02
-------------------------------------------------------------------------
group: Stimpy:South
  vars n mean   sd median trimmed  mad min max range skew kurtosis   se
1    1 4   17 5.16     17      17 5.93  11  23    12    0    -2.08 2.58


Summaries for data frames

Often we will want to summarize variables in a whole data frame, either to get some summary statistics for individual variables, or to check that variables have the values we expected in inputting the data, to be sure there wasn’t some error.

 

The str function

The str function in the native utils package will list variables for a data frame, with their types and levels.  Data is the name of the data frame, created above.

 

str(Data)

 

'data.frame':     16 obs. of  3 variables:
 $ Instructor: Factor w/ 2 levels "Ren","Stimpy": 1 1 1 1 1 1 1 1 2 2 ...
 $ Location  : Factor w/ 2 levels "North","South": 1 1 1 1 2 2 2 2 1 1 ...
 $ Attendees : int  7 22 6 15 12 13 14 16 18 17 ...

   ### Instructor is a factor (nominal) variable with two levels.
   ### Location is a factor (nominal) variable with two levels.
   ### Attendees is an integer variable.



The summary function

The summary function in the native base package will summarize all variables in a data frame, listing the frequencies for levels of nominal variables; and for ratio/interval data, the minimum, 1st quartile, median, mean, 3rd quartile, and maximum.

 

summary(Data)


  Instructor  Location   Attendees   
 Ren   :8    North:8   Min.   : 6.00 
 Stimpy:8    South:8   1st Qu.:11.75 
                       Median :15.00 
                       Mean   :14.50 
                       3rd Qu.:17.25 
                       Max.   :23.00


The headTail function in psych

The headTail function in the psych package reports the first and last observations for a data frame.


library(psych)

headTail(Data)


    Instructor Location Attendees
1          Ren    North         7
2          Ren    North        22
3          Ren    North         6
4          Ren    North        15
...       <NA>     <NA>       ...
13      Stimpy    South        15
14      Stimpy    South        11
15      Stimpy    South        19
16      Stimpy    South        23


The describe function in psych

The describe function in the psych package reports number of observations, mean, standard deviation, trimmed means, minimum, maximum, range, skew, kurtosis, and standard error for variables in a data frame. 

 

Note that factor variables are labeled with an asterisk (*), and the levels of the factors are coded as 1, 2, 3, etc.

 

library(psych)

describe(Data)


            vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
Instructor*    1 16  1.5 0.52    1.5     1.5 0.74   1   2     1  0.00    -2.12 0.13
Location*      2 16  1.5 0.52    1.5     1.5 0.74   1   2     1  0.00    -2.12 0.13
Attendees      3 16 14.5 4.83   15.0    14.5 4.45   6  23    17 -0.04    -0.88 1.21


Dealing with missing values

Sometimes a data set will have missing values.  This can occur for a variety of reasons, such as a respondent not answering a specific question, or a researcher being unable to make a measurement due to temporarily malfunctioning equipment.

 

In R, a missing value is indicated with NA.

 

By default, different functions in R will handle missing values in different ways.  But most have options to change how they treat missing data.

 

In general, you should scan your data for missing data, and think carefully about the best way to handle observations with missing values.


Input = ("
Instructor  Location  Attendees
Ren         North      7
Ren         North     22
Ren         North      6
Ren         North     15
Ren         South     12
Ren         South     13
Ren         South     NA
Ren         South     16
Stimpy      North     18
Stimpy      North     17
Stimpy      North     NA
Stimpy      North      9
Stimpy      South     15
Stimpy      South     11
Stimpy      South     19
Stimpy      South     23
")

Data2 = read.table(textConnection(Input),header=TRUE)

   ###  Note: This data frame will be called Data2 to distinguish it
   ###          from Data above.


Data2


The na.rm option for missing values with a simple function

Many common functions in R have a na.rm option.  If this option is set to FALSE, the function will return an NA result if there are any NA’s in the data values passed to the function.  If set to TRUE, observations with NA values will be discarded, and the function will continue to calculate its output.

 

Note that the na.rm option operates only on the data values actually passed to the function.  In the following example with median, only Attendees is passed to the function; if there were NA’s in other variables, this would not affect the function.

 

Not all functions have the same default for the na.rm option.  To determine the default, use e.g. ?median, ?mean, ?sd.


median(Data2$Attendees,
       na.rm = FALSE)


NA

   ###  na.rm=FALSE. Since there is an NA in the data, report NA.


median(Data2$Attendees,
       na.rm = TRUE)


15

   ###  na.rm=TRUE. Drop observations with NA and then calculate the median.


Missing values with Summarize in FSA

Summarize in FSA will indicate invalid values, including NA’s, with the count of valid observations outputted as the variable nvalid.


library(FSA)

Summarize(Attendees ~ Instructor,
          data=Data2)


  Instructor n nvalid mean       sd min   Q1 median   Q3 max percZero
1        Ren 8      7   13 5.477226   6  9.5     13 15.5  22        0
2     Stimpy 8      7   16 4.795832   9 13.0     17 18.5  23        0

   ###  This function removes missing values, but indicates the number of
   ###    missing values by not including them in the count for nvalid.


Missing values indicated with summary function

The summary function counts NA’s for numeric variables.


summary(Data2)


  Instructor  Location   Attendees   
 Ren   :8    North:8   Min.   : 6.00 
 Stimpy:8    South:8   1st Qu.:11.25 
                       Median :15.00 
                       Mean   :14.50 
                       3rd Qu.:17.75 
                       Max.   :23.00 
                       NA's   :2

   ###  Indicates two NA’s in Attendees.


Missing values in the describe function in psych

The describe function is the psych package removes NA’s by default.


library(psych)

describe(Data2$Attendees)


  vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 14 14.5 5.19     15    14.5 5.19   6  23    17 -0.04    -1.17 1.39

   ###  Note that two NA’s were removed by default, reporting an n of 14.


Missing values in the summarySE function in Rmisc

By default, the summarySE function does not remove NA’s, but can be made to do so with the na.rm=TRUE option.


library(Rmisc)

summarySE(data=Data2,
          "Attendees")


   .id  N Attendees sd se ci
1 <NA> 16        NA NA NA NA

   ###  Note an N of 16 is reported, and statistics are reported as NA.


library(Rmisc)

summarySE(data=Data2,
          "Attendees",
          na.rm=TRUE)


   .id  N Attendees       sd      se       ci
1 <NA> 14      14.5 5.185038 1.38576 2.993752

   ###  Note an N of 14 is reported, and statistics are calculated with
   ###    NA’s removed.



Advanced techniques

 

Discarding subjects

Certain psychology studies use scales that are calculated from responses of several questions.  Because answers to all questions are needed to reliably calculate the scale, any observation (subject, person) with missing answers will simply be discarded.  Some types of analyses in other fields also follow this approach.

 

Subjects can be deleted with the subset function.  The following code creates a new data frame, Data3, with all observations with NA in the variable Attendees removed from Data2.

 

Data3 = subset(Data2,
               !is.na(Attendees))

Data3


Imputation of values

Missing values can be assigned a likely value through the process of imputation.  An algorithm is used that determines a value based on the values of other variables for that observation relative to values for other variables.  The mice package in R can perform imputation.

 

Optional code: removing missing values in vectors

If functions don’t have a na.rm option, it is possible to remove NA observations manually.  Here we’ll create a vector called valid that just contains those values of Attendees that are not NA’s. 

 

The ! operator is a logical not.  The brackets serve as a list of observations to include for the preceding variable.  So, the code essentially says, “Define a vector valid as the values of Attendees where the values of Attendees are not NA.”

 

valid = Data2$Attendees[!is.na(Data2$Attendees)]

summary(valid)


Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   6.00   11.25   15.00   14.50   17.75   23.00


Statistics of shape for interval/ratio data

The most common statistics for shape are skewness and kurtosis. 

 

Understanding skewness and kurtosis are important as they are ways in which a distribution of data varies from a normal distribution.  This will be important in assessing the assumptions of certain statistical tests. 

 

However, I rarely see skewness and kurtosis values reported.  Instead, normality is usually assessed visually with plot, or using certain statistical tests.  One problem with using skewness and kurtosis values is that there is not agreement in what values constitute meaningful deviations from the normal curve.

 

Skewness

Skewness indicates the degree of asymmetry in a data set.  If there are relatively more values that are far greater than the mean, the distribution is positively skewed or right skewed, with a tail stretching to the right.  Negative or left skew is the opposite.

 

A symmetric distribution has a skewness of 0.  The skewness value for a positively skewed distribution is positive, and a negative value for a negatively skewed distribution.  Sometimes a skew with an absolute value greater than 1 or 1.5 or 2 is considered highly skewed.  There is not agreement on this interpretation.

 

There are different methods to calculate skew and kurtosis.  The describe function in the psych package has three options for how to calculate them.

 

library(psych)              

describe(Data$Attendees,
         type=3)         ### Type of calculation for skewness and kurtosis

 

  vars  n mean   sd median trimmed  mad min max range  skew kurtosis   se
1    1 16 14.5 4.83     15    14.5 4.45   6  23    17 -0.04    -0.88 1.21

   ### Skewness and kurtosis among other statistics



image


image  image


The normal curve is symmetrical around its center.  The positively-skewed distribution has a longer, thicker tail to the right.  The negatively-skewed distribution has a longer, thicker tail to the left.

For more information on the normal curve, see the video on “Normal Distribution” from Statistics Learning Center in the “Optional Readings” section.  Additional thoughts on the normal distribution and real-world data distributions are in the article by Dr. Nic in the “Optional Readings” section.

 

Kurtosis

 

Kurtosis measures the degree to which the distribution is either more sharply peaked or more squat relative to a normal distribution.  In general, the higher the kurtosis, the sharper the peak and the longer the tails.  This is called leptokurtic, and is indicated by positive kurtosis values.  The opposite—platykurtosis—has negative kurtosis values.  Sometimes a kurtosis with an absolute value greater than 2 or 3 is considered a high deviation from being mesokurtic.  There is not agreement on this interpretation.

 

Descriptive statistics for ordinal data

 

Descriptive statistics for ordinal data are more limited than those for interval/ratio data.  You’ll remember that for ordinal data, the levels can be ordered, but we can’t say that the intervals between the levels are equal.  For example, we can’t say that an Associate’s degree and Master’s degree somehow average out to a Bachelor’s degree.  This concept is discussed in more detail in the chapters on Likert data.

 

Because of this fact, several common descriptive statistics are usually inappropriate for use with ordinal data.  These include mean, standard deviation, and standard error of the mean.

 

Ordinal data can be described by either: 1) treating the data as numeric and using appropriate statistics such as median and quartiles; or, 2) treating the data as nominal, and looking at counts of the data for each level.

 

A more-complete discussion of descriptive statistics for ordinal data can be found in the Descriptive Statistics for Likert Data chapter.

 

Example of descriptive statistics for ordinal data

For this example, imagine that Arthur and Baxter have each held a workshop educating the public about preventing water pollution at home.  They are interested in the education level of people who attended the workshops.  We can consider education level an ordinal variable.  They also collected information on the sex and county of participants.

 

For this data, we have manually coded Education with numbers in a separate variable Ed.code.  These numbers list the education in order:  High school  <  Associate’s  <  Bachelor’s  <  Master’s  <  Ph.D.

 

Of course we could have had R do this coding for us, but the code is slightly messy, so I did the coding manually, and listed the optional R code below.

 

Ideally we would want to treat Education as an ordered factor variable in R.  But unfortunately most common functions in R won’t handle ordered factors well.  Later in this book, cumulative link models (CLM) will use ordered factor data directly.  Optional R code is shown below for converting a factor to an ordered factor. 

 

Input = ("
Date          Instructor       Student  Sex     County       Education  Ed.code
'2015-11-01'  'Arthur Read'    a        female  'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    b        female  'Bear Lake'  PHD        5
'2015-11-01'  'Arthur Read'    c        male    'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    d        female  'Elwood'     MA         4
'2015-11-01'  'Arthur Read'    e        male    'Elwood'     HS         1
'2015-11-01'  'Arthur Read'    f        female  'Bear Lake'  MA         4
'2015-11-01'  'Arthur Read'    g        male    'Elwood'     HS         1
'2015-11-01'  'Arthur Read'    h        female  'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    i        female  'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    j        female  'Elwood'     BA         3
'2015-12-01'  'Buster Baxter'  k        male    'Elwood'     MA         4
'2015-12-01'  'Buster Baxter'  l        male    'Bear Lake'  MA         4
'2015-12-01'  'Buster Baxter'  m        female  'Elwood'     AA         2
'2015-12-01'  'Buster Baxter'  n        male    'Elwood'     AA         2
'2015-12-01'  'Buster Baxter'  o        other   'Elwood'     BA         3
'2015-12-01'  'Buster Baxter'  p        female  'Elwood'     BA         3
'2015-12-01'  'Buster Baxter'  q        female  'Bear Lake'  PHD        5
")

Data = read.table(textConnection(Input),header=TRUE)


###  Check the data frame

Data

str(Data)


'data.frame':     17 obs. of  7 variables:

 $ Date      : Factor w/ 2 levels "2015-11-01","2015-12-01": 1 1 1 1 1 1 1 1 1 1
 $ Instructor: Factor w/ 2 levels "Arthur Read",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Student   : Factor w/ 17 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Sex       : Factor w/ 3 levels "female","male",..: 1 1 2 1 2 1 2 1 1 1 ...
 $ County    : Factor w/ 2 levels "Bear Lake","Elwood": 2 1 2 2 2 1 2 2 2 2 ...
 $ Education : Factor w/ 5 levels "AA","BA","HS",..: 2 5 2 4 3 4 3 2 2 2 ...
 $ Ed.code   : int  3 5 3 4 1 4 1 3 3 3 ...



summary(Data)


         Date            Instructor    Student       Sex           County 
 2015-11-01:10   Arthur Read  :10   a      : 1   female:10   Bear Lake: 4 
 2015-12-01: 7   Buster Baxter: 7   b      : 1   male  : 6   Elwood   :13 
                                    c      : 1   other : 1                
                                    d      : 1                            
                                    e      : 1                            
                                    f      : 1                            
                                    (Other):11                            
 Education    Ed.code    
 AA :2     Min.   :1.000  
 BA :7     1st Qu.:3.000 
 HS :2     Median :3.000 
 MA :4     Mean   :3.118 
 PHD:2     3rd Qu.:4.000 
           Max.   :5.00



### Remove unnecessary objects

rm(Input)


Optional code to assign values to a variable based on another variable


Data$Ed.code[Data$Education=="HS"] = 1
Data$Ed.code[Data$Education=="AA"] = 2
Data$Ed.code[Data$Education=="BA"] = 3
Data$Ed.code[Data$Education=="MA"] = 4
Data$Ed.code[Data$Education=="PHD"] = 5



Optional code to change a factor variable to an ordered factor variable


Data$Education.ordered = factor(Data$Education,
                                ordered = TRUE,
                                levels = c("HS", "AA", "BA", "MA", "PHD")
                                )
str(Data$Education.ordered)

summary(Data$Education.ordered)  


 HS  AA  BA  MA PHD
  2   2   7   4   2


median(Data$Education.ordered)


[1] "BA"


Statistics of location for ordinal data

Using the mean is usually not appropriate for ordinal data.  Instead the median should be used as a measure of location.  We will use our numerically-coded variable Ed.code for the analysis.

 

median(Data$Ed.code)


3

   ### Remember that 3 meant Bachelor’s degree in our coding


Statistics of variation for ordinal data

Statistics like standard deviation and standard error of the mean are usually inappropriate for ordinal data.  Instead the five-number summary can be used.


summary(Data$Ed.code)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   3.000   3.000   3.118   4.000   5.000

   ### Remember to ignore the mean value for ordinal data.

   ### Remember that 1 meant High school, 3 meant Bachelor’s,
   ###   4 meant Master’s, and 5 meant Ph.D.


Statistics for grouped ordinal data

Because the Summarize function in the FSA package reports the five-number summary, it is a useful function for descriptive statistics for grouped ordinal data.


library(FSA)

Summarize(Ed.code ~ Sex,
          data=Data)


     Sex  n nvalid mean        sd min   Q1 median   Q3 max percZero
1 female 10     10  3.5 0.9718253   2 3.00    3.0 4.00   5        0
2   male  6      6  2.5 1.3784049   1 1.25    2.5 3.75   4        0
3  other  1      1  3.0        NA   3 3.00    3.0 3.00   3        0

   ###  Remember to ignore the mean and sd values for ordinal data.



library(FSA)

Summarize(Ed.code ~ Sex + County,
          data=Data)


     Sex    County n nvalid     mean        sd min  Q1 median Q3 max percZero
1 female Bear Lake 3      3 4.666667 0.5773503   4 4.5      5  5   5        0
2   male Bear Lake 1      1 4.000000        NA   4 4.0      4  4   4        0
3 female    Elwood 7      7 3.000000 0.5773503   2 3.0      3  3   4        0
4   male    Elwood 5      5 2.200000 1.3038405   1 1.0      2  3   4        0
5  other    Elwood 1      1 3.000000        NA   3 3.0      3  3   3        0

   ###  Remember to ignore the mean and sd values for ordinal data.


Statistics for ordinal data treated as nominal data

The summary function in the native base package and the xtabs function in the native stats package provide counts for levels of a nominal variable.

 

The prop.table function translates a table into proportions.  The margin=1 option indicates that the proportions are calculated for each row.

 

First we will order the levels of Education, otherwise R with report results in alphabetical order.


Data$Education = factor(Data$Education,
                        levels = c("HS", "AA", "BA", "MA", "PHD"))

   ### Order factors, otherwise R will alphabetize them


summary(Data$Education)


HS  AA  BA  MA PHD
 2   2   7   4   2

  
### Counts of each level of Education


XT = xtabs( ~ Education,
            data=Data)

XT


Education
 HS  AA  BA  MA PHD
  2   2   7   4   2


prop.table(XT)


Education
       HS        AA        BA        MA       PHD
0.1176471 0.1176471 0.4117647 0.2352941 0.1176471


Grouped data


XT = xtabs( ~ Sex + Education,
            data = Data)

XT


        Education
Sex      HS AA BA MA PHD
  female  0  1  5  2   2
  male    2  1  1  2   0
  other   0  0  1  0   0


prop.table(XT,
           margin = 1)


        Education
Sex             HS        AA        BA        MA       PHD
  female 0.0000000 0.1000000 0.5000000 0.2000000 0.2000000
  male   0.3333333 0.1666667 0.1666667 0.3333333 0.0000000
  other  0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

   ### Proportion of responses for each row


Two-way grouped data


XT = xtabs(~ Sex + Education + County,
           data=Data)

XT

   ###  Note that the dependent variable, Education, is the middle
   ###    of the variable list.


, , County = Bear Lake

        Education
Sex      HS AA BA MA PHD
  female  0  0  0  1   2
  male    0  0  0  1   0
  other   0  0  0  0   0

, , County = Elwood

        Education
Sex      HS AA BA MA PHD
  female  0  1  5  1   0
  male    2  1  1  1   0
  other   0  0  1  0   0

Descriptive statistics for nominal data

 

Descriptive statistics for nominal data consist of listing or plotting counts for levels of the nominal data, often by levels of a grouping variable.  Tables of this information are often called contingency tables.

 

For this example, we will look again at Arthur and Buster’s data, but this time considering Sex to be the dependent variable of interest.

 

If levels of a nominal variable are coded with numbers, remember that the numbers will be arbitrary.  For example, if you assign female = 1, and male = 2, and other = 3, it makes no sense to say that the average sex in Arthur’s class was 1.3.  Or that the average sex in Buster’s class was greater than that in Arthur’s.  It also makes no sense to say that the median sex was female.


Input = ("
Date          Instructor       Student  Sex     County       Education  Ed.code
'2015-11-01'  'Arthur Read'    a        female  'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    b        female  'Bear Lake'  PHD        5
'2015-11-01'  'Arthur Read'    c        male    'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    d        female  'Elwood'     MA         4
'2015-11-01'  'Arthur Read'    e        male    'Elwood'     HS         1
'2015-11-01'  'Arthur Read'    f        female  'Bear Lake'  MA         4
'2015-11-01'  'Arthur Read'    g        male    'Elwood'     HS         1
'2015-11-01'  'Arthur Read'    h        female  'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    i        female  'Elwood'     BA         3
'2015-11-01'  'Arthur Read'    j        female  'Elwood'     BA         3
'2015-12-01'  'Buster Baxter'  k        male    'Elwood'     MA         4
'2015-12-01'  'Buster Baxter'  l        male    'Bear Lake'  MA         4
'2015-12-01'  'Buster Baxter'  m        female  'Elwood'     AA         2
'2015-12-01'  'Buster Baxter'  n        male    'Elwood'     AA         2
'2015-12-01'  'Buster Baxter'  o        other   'Elwood'     BA         3
'2015-12-01'  'Buster Baxter'  p        female  'Elwood'     BA         3
'2015-12-01'  'Buster Baxter'  q        female  'Bear Lake'  PHD        5
")

Data = read.table(textConnection(Input),header=TRUE)


###  Check the data frame


Data

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Example of descriptive statistics for nominal data

The summary function in the native base package and the xtabs function in the native stats package provide counts for levels of a nominal variable.

 

The prop.table function translates a table into proportions.  The margin=1 option indicates that the proportions are calculated for each row.

 
One-sample data


summary(Data$Sex)


female   male  other
    10      6      1

   ### Counts of each level of Sex


One-way data


xtabs(~ Date + Sex,
      data=Data)


            Sex
Date         female male other
  2015-11-01      7    3     0
  2015-12-01      3    3     1


XT = xtabs(~ Date + Sex,
           data=Data)

prop.table(XT,
           margin = 1)


            Sex
Date            female      male     other
  2015-11-01 0.7000000 0.3000000 0.0000000
  2015-12-01 0.4285714 0.4285714 0.1428571

   ### Proportion of each level of Sex for each row


sum(XT)


[1] 17

   ### Sum of observation in the table


rowSums(XT)


2015-11-01 2015-12-01
        10          7

   ### Sum of observation in each row of the table

colSums(XT)


female   male  other
    10      6      1

   ### Sum of observation in each column of the table


Two-way data


xtabs(~ County + Sex + Date,
      data=Data)

   ###  Note that the dependent variable, Sex, is the middle
   ###    of the variable list.


, , Date = 2015-11-01

           Sex
County      female male other
  Bear Lake      2    0     0
  Elwood         5    3     0

, , Date = 2015-12-01

           Sex
County      female male other
  Bear Lake      1    1     0
  Elwood         2    2     1



Required readings

 

[Video]  “Understanding Summary statistics: Mean, Median, Mode” from Statistics Learning Center (Dr. Nic). 2015. www.youtube.com/watch?v=rAN6DBctgJ0.

 

Optional readings

 

Statistics of location:  mean and median

 

“Statistics of central tendency” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/central.html.

 

"The median outclasses the mean" from Dr. Nic. 2013. Learn and Teach Statistics & Operations Research. learnandteachstatistics.wordpress.com/2013/04/29/median/.

 

“Measures of the Location of the Data”, Section 2.3 in Openstax. 2013. Introductory Statistics. openstaxcollege.org/textbooks/introductory-statistics.

 

“Measures of the Center of the Data”, Section 2.5 in Openstax. 2013. Introductory Statistics. openstaxcollege.org/textbooks/introductory-statistics.

 

“Skewness and the Mean, Median, and Mode”, Section 2.6 in Openstax. 2013. Introductory Statistics. openstaxcollege.org/textbooks/introductory-statistics.

 

Statistics of variation:  standard deviation and range

 

“Statistics of dispersion” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/dispersion.html.

 

“Standard error of the mean” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/standarderror.html.

 

“Measures of the Spread of the Data”, Section 2.7 in Openstax. 2013. Introductory Statistics. openstaxcollege.org/textbooks/introductory-statistics.

 

The normal distribution

 

[Video]  “Normal Distribution” from Statistics Learning Center (Dr. Nic). 2016. www.youtube.com/watch?v=mtH1fmUVkfE.

 

"The normal distribution – three tricky bits" from Dr. Nic. 2016. Learn and Teach Statistics & Operations Research. learnandteachstatistics.wordpress.com/2016/01/18/the-normal-distribution/.

 

Optional note on reporting summary statistics honestly

 

Every time we report a descriptive statistic or the result of a statistical test, we are condensing information, which may have been a whole data set, into one or a few pieces of information.  It is the job of the analyst to choose the best way to present descriptive and graphical information, and to choose the correct statistical test or method.

 

SAT score example

Imagine a set of changes in SAT scores from seven students who took a certain SAT preparation course.  The change in scores, Increase, represents the difference in score, that is the score after taking the course minus their initial score.

 

Increase = c(50, 60, 120, -80, -10, 10, 0)


Our first instinct in assessing the course might be to look at the average (mean) change in score.  The result is a mean increase of 21 points, which could be considered a success, perhaps.


mean(Increase)

[1] 21.42857


But we would be remiss to not look at other summary statistics and plots.

 

Using the Summarize function in the FSA package, we find that the median increase was only 10 points.  The increase for the first quartile was negative, suggesting at least 25% of students got a lower score afterwards.


library(FSA)

Summarize(Increase,
          digits = 2)


       n   nvalid     mean       sd      min       Q1   median       Q3      max
    7.00     7.00    21.43    63.09   -80.00    -5.00    10.00    55.00   120.00


A histogram of the changes in scores suggests a slight right skew to the data, but that the mass of the data sits not far from zero.


hist(Increase,

     col="gray")


image

 

Finally, we’ll compute the 95% confidence intervals for the mean change in score.  Looking at the results for the percentile method, the confidence interval includes zero, suggesting the change in scores for this course were not different from zero.


library(boot)

Mboot = boot(Increase,
             function(x,i) mean(x[i]),
             R=10000)


boot.ci(Mboot,
        conf = 0.95,
        type = c("perc")
        )


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Level     Percentile    
95%   (-21.43,  64.29 ) 


Based on the data exploration, it seems that it would be irresponsible or dishonest to simply report that the average increase in test scores was 20 points.

 

For descriptive statistics, we might report the mean and 95% confidence interval.  Or perhaps the 5-point summary for the change in scores, or show the histogram of values.

 

Optional analyses

 

Robust estimators:  trimmed mean and Winsorized mean

 

Robust estimators of central tendency are used to describe the location of a data set without undue influence of extreme values.  Robust estimators include trimmed means, Winsorized means, and other estimates like M-estimators (not discussed here).

 

Example

The New Jersey Envirothon is a statewide competition for high school students covering various aspects of natural resources science.  In the Team Presentation station, student teams are judged by five judges. 

 

Currently, scorekeepers drop the highest and lowest scores for each team to avoid the effect of aberrant low or high scores.  This is an example of a trimmed mean.  In this case, the mean is trimmed to 60% (with 20% removed from each side).

 

Another option to ameliorate the effect of extreme scores is using Winsorized means.  A Winsorized mean removes extreme observations, but replaces them with the closest observations in terms of magnitude.

 

In this example, two teams received five scores for their presentations that have the same mean and median.  We will look at robust estimators to determine if one team should be scored higher than the other.


Team.A = c(100, 90, 80, 60, 20)

median(Team.A)


[1] 80


mean(Team.A)


[1] 70


mean(Team.A,
     trim = .20)            # This trims to the inner 60% of observations


[1] 76.66667


library(psych)

winsor(Team.A,
       trim = 0.20)         # This Winsorizes to the inner 60% of observations


[1] 92 90 80 60 52

    ### Note that the Winsorized values at the extremes appear to be calculated
    ###   with a function analogous to the quantile(x, probs, type = 7) function.



winsor.mean(Team.A,
             trim = 0.20)   # This Winsorizes to the inner 60% of observations


[1] 74.8


Team.B = c(80, 80, 80, 80, 30)

median(Team.B)


[1] 80


mean(Team.B)


[1] 70


mean(Team.B,
     trim = .20)            # This trims to the inner 60% of observations


[1] 80


library(psych)

winsor(Team.B,
       trim = 0.20)         # This Winsorizes to the inner 60% of observations


[1] 80 80 80 80 70

    ### Note that the Winsorized values at the extremes appear to be calculated
    ###   with a function analogous to the quantile(x, probs, type = 7) function.



winsor.mean(Team.B,

             trim = 0.20)   # This Winsorizes to the inner 60% of observations


[1] 78


In this example, the means and medians for Team A and Team B were identical.  However, the trimmed mean for Team A was less than for Team B (77 vs. 80), and the Winsorized mean for A was less than for Team B (75 vs. 78).  According to either the trimmed mean method or the Winsorized mean method, Team B had a higher score.

 

Note also that the Winsorized means were lower than for the trimmed means, for both teams.  This is because the Winsorized mean is better able to take into account the low scores for each team.

 

Geometric mean

 

The geometric mean is used to summarize certain measurements, such as average return for investments, and for certain scientific measurements, such as bacteria counts in environmental water.  It is useful when data are log-normally distributed.

 

Practically speaking, using the geometric mean ameliorates the effect of outlying values in the data.

 

To get the geometric mean, the log of each value is taken, these values are averaged, and then the result is the base of the log raised to this value.

 

Imagine a series of 9 counts of bacteria from lake water samples, called Bacteria here.  The geometric mean can be calculated with a nested log, mean, and exp functions.  Or more simply, the geometric.mean function in the psych package can be used.

 

Bacteria example


Bacteria = c(20, 40, 50, 60, 100, 120, 150, 200, 1000)

exp(mean(log(Bacteria)))


[1] 98.38887


library(psych)

geometric.mean(Bacteria)


[1] 98.38887


hist(Bacteria,    ### Produce a histogram of values
col="darkgray",
breaks="FD")


Annual return example

Geometric means are also used to calculate the average annual return for investments.  Each value in this vector represents the percent return of the investment each year.  The arithmetic mean will not give the correct answer for average annual return.


Return = c(-0.80, 0.20, 0.30, 0.30, 0.20, 0.30)

library(psych)

geometric.mean(Return + 1)-1


[1] -0.07344666

   ### The geometric mean will give the correct answer for average
   ###  annual return

 

This can also be calculated manually.  If you start with 100 dollars with this investment, you will end up with 63 dollars, equivalent to a return of – 0.07.


100 * (1-0.80) * 1.20 * 1.30 * 1.30 * 1.20 * 1.30


[1] 63.2736


Harmonic mean

 

The harmonic mean is another type of average that is used in certain situations, such as averaging rates or speeds.

 

To get the harmonic mean, the inverse of each value is taken, these values are averaged, and then the inverse of this result is reported.

 

Imagine a series of 9 speeds, called Speed here.  The geometric mean can be calculated with a nested 1/x, mean, and 1/x functions.  Or more simply, the harmonic.mean function in the psych package can be used.


Speed = c(20, 40, 50, 60, 100, 120, 150, 200, 1000)

1/mean(1/Speed)


63.08411


library(psych)

harmonic.mean(Speed)


63.08411


Exercises C

 

1. Considering Ren and Stimpy's workshops,

 

•  How many total attendees were there?

•  How many total workshops were there?

•  How many locations were there?

•  What was the mean number of attendees?

•  What was the median number of attendees?

 

2. Considering Ren and Stimpy's workshops,

 

•  Which workshops had a higher mean number of attendees and what was the mean?  Ren’s or Stimpy’s?

•  Which workshops had a higher mean number of attendees and what was the mean?  Ren North or Ren South?

 

3. Considering Arthur and Buster's workshops,

 

•  How many students attended in total?

•  How many distinct levels of education were reported?

•  What was the median level of education reported?

•  75% of students had what level of education or lower?

•  What was the median education level for females?

•  What was the median education level for males?

•  What was the median education level for females from Elwood County?

•  What was the median education level for males from Elwood County?

 

4. Considering Arthur and Buster's workshops,

 

•  How many students were male?

•  What percentage is this of total students?

•  How many students in the November workshop were male?

•  What percentage is this of total students in the November workshop?

•  How many male students in the November workshop were from Bear Lake County?

 

5.  As part of a nutrition education program, extension educators had students keep diaries of what they ate for a day and then calculated the calories students consumed.  The following data are the result.  Rating indicates the score, from 1 to 5, that students gave as to the usefulness of the program.  It should be considered an ordinal variable.


Student  Teacher  Sex     Calories  Rating
a        Tetsuo   male    2300      3
b        Tetsuo   female  1800      3
c        Tetsuo   male    1900      4
d        Tetsuo   female  1700      5
e        Tetsuo   male    2200      4
f        Tetsuo   female  1600      3
g        Tetsuo   male    1800      3
h        Tetsuo   female  2000      3
i        Kaneda   male    2100      4
j        Kaneda   female  1900      5
k        Kaneda   male    1900      4
l        Kaneda   female  1600      4
m        Kaneda   male    2000      4
n        Kaneda   female  2000      5
o        Kaneda   male    2100      3
p        Kaneda   female  1800      4

 

a.  What are the variables in this data set and what type of variable is each?  (Types are “nominal”, “ordinal”, and “interval/ratio”, not how their types are reported by R.)

 

For each of the following, answer the question, and show the output from the analyses you used to answer the question.

 

b.  How many students were involved in this data set?

 

c.  What was the mean caloric intake?

 

d.  What was the median caloric intake?

 

e.  What was the standard deviation of caloric intake?

 

f.  What was the mean caloric intake for females?


g.  What was the mean caloric intake for females in Kaneda’s class?

 

h.  How many males are in Kaneda’s class?