[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

p-values and R-square Values for Models

When developing more complex models it is often desirable to report a p-value for the model as a whole as well as an R-square for the model.

 

p-values for models

The p-value for a model determines the significance of the model compared with a null model.  For a linear model, the null model is defined as the dependent variable being equal to its mean.  So, the p-value for the model is answering the question, Does this model explain the data significantly better than would just looking at the average value of the dependent variable?

 

R-squared and pseudo R-squared

The R-squared value is a measure of how well the model explains the data.  It is an example of a goodness-of-fit statistic.

 

R-squared for linear (ordinary least squares) models

In R, models fit with the lm function are linear models fit with ordinary least squares (OLS).  For these models, R-squared indicates the proportion of the variability in the dependent variable that is explained by model.  That is, an R-squared of 0.60 indicates that 60% of the variability in the dependent variable is explained by the model.

 

Pseudo R-squared

For many types of models, R-squared is not defined.  These include relatively common models like logistic regression and the cumulative link models used in this book.  For these models, pseudo R-squared measures can be calculated.  A pseudo R-squared is not directly comparable to the R-squared for OLS models.  Nor can it can be interpreted as the proportion of the variability in the dependent variable that is explained by model.  Instead, pseudo R-squared measures are relative measures among similar models indicating how well the model explains the data.

 

This book uses three pseudo R-squared measures: McFadden, Cox and Snell (also referred to as ML), Nagelkerke (also referred to as Cragg and Uhler).

 

In general I favor the Nagelkerke pseudo R-squared, but there is no agreement as to which pseudo R-squared measurement should be used.

 

p-values and R-squared values.

p-values and R-squared values measure different things.  The p-value indicates if there is a significant relationship described by the model, and the R-squared measures the degree to which the data is explained by the model.  It is therefore possible to get a significant p-value with a low R-squared value.  This often happens when there is a lot of variability in the dependent variable, but there are enough data points for a significant relationship to be indicated.

 

Packages used in this chapter

 

The packages used in this chapter include:

•  psych

•  lmtest

•  boot

•  rcompanion

 

The following commands will install these packages if they are not already installed:


if(!require(psych)){install.packages("psych")}
if(!require(lmtest)){install.packages("lmtest")}
if(!require(boot)){install.packages("boot")}
if(!require(rcompanion)){install.packages("rcompanion")}


Example of model p-value, R-squared, and pseudo R-squared

 

The following example uses some hypothetical data of a sample of people for which typing speed (Words.per.minute) and age were measured.  After plotting the data, we decide to construct a polynomial model with Words.per.minute as the dependent variable and Age and Age2 as the independent variables.  Notice in this example that all variables are treated as interval/ratio variables, and that the independent variables are also continuous variables.

 

The data will first be fit with a linear model with the lm function.  Passing this model to the summary function will display the p-value for the model and the R-squared value for the model. 

 

The same data will then be fit with a generalized linear model with the glm function.  This type of model allows more latitude in the types of data that can be fit, but in this example, we’ll use the family=gaussian option, which will mimic the model fit with the lm function, though the underlying math is different.

 

Importantly, the summary of the glm function does not produce a p-value for the model nor an R-squared for the model. 

 

For the model fit with glm, the p-value can be determined with the anova function comparing the fitted model to a null model.  The null model is fit with only an intercept term on the right side of the model.  As an alternative, the nagelkerke function described below also reports a p-value for the model, using the likelihood ratio test.

 

There is no R-squared defined for a glm model.  Instead a pseudo R-squared can be calculated.  The function nagelkerke produces pseudo R-squared values for a variety of models.  It reports three types:  McFadden, Cox and Snell, and Nagelkerke.  In general I recommend using the Nagelkerke measure, though there is no agreement on which pseudo R-squared measurement to use, if any at all.

 

The Nagelkerke is the same as the Cox and Snell, except that the value is adjusted upward so that the Nagelkerke has a maximum value of 1. 

 

It has been suggested that a McFadden value of 0.2–0.4 indicates a good fit.

 

Note that these models makes certain assumptions about the distribution of the data, but for simplicity, this example will ignore the need to determine if the data met these assumptions.

 

Input =("
Age   Words.per.minute
 12    23
 12    32
 12    25
 13    27
 13    30
 15    29
 15    33
 16    37
 18    29
 22    33
 23    37
 24    33
 25    43
 27    35
 33    30
 42    25
 53    22
")

Data = read.table(textConnection(Input),header=TRUE)


###  Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Plot the data


plot(Words.per.minute ~ Age,
     data = Data,
     pch=16,
     xlab = "Age",
     ylab = "Words per minute")


image


Prepare data


### Create new variable for the square of Age

Data$Age2 = Data$Age ^ 2


### Double check data frame

library(psych)

headTail(Data)


    Age Words.per.minute Age2
1    12               23  144
2    12               32  144
3    12               25  144
4    13               27  169
... ...              ...  ...
14   27               35  729
15   33               30 1089
16   42               25 1764
17   53               22 2809


Linear model


model = lm (Words.per.minute ~ Age + Age2,
                  data=Data)


summary(model)       ### Shows coefficients,
                     ###   p-value for model, and R-squared


Multiple R-squared:  0.5009,  Adjusted R-squared:  0.4296
F-statistic: 7.026 on 2 and 14 DF,  p-value: 0.00771

   ### p-value and (multiple) R-squared value


Simple plot of data and model

For bivariate data, the function plotPredy will plot the data and the predicted line for the model.  It also works for polynomial functions, if the order option is changed.


library(rcompanion)

plotPredy(data  = Data,
          y     = Words.per.minute,
          x     = Age,
          x2    = Age2,
          model = model,
          order = 2,
          xlab  = "Words per minute",
          ylab  = "Age")


image


Generalized linear model


model = glm (Words.per.minute ~ Age + Age2,
             data = Data,
             family = gaussian)


summary(model)       ### Shows coefficients


Calculate p-value for model

In R, the most common way to calculate the p-value for a fitted model is to compare the fitted model to a null model with the anova function.  The null model is usually formulated with just a constant on the right side.


null = glm (Words.per.minute ~ 1,    ### Create null model
             data = Data,            ###   with only a constant on the right side
             family = gaussian)

anova (model,
       null,
       test="Chisq")                 ### Tests options "Rao", "LRT",
                                     ###   "Chisq", "F", "Cp"
                                     ### But some work with only some model types


  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)   
1        14     243.07                         
2        16     487.06 -2  -243.99 0.0008882 ***


Calculate pseudo R-squared and p-value for model

An alternative test for the p-value for a fitted model, the nagelkerke function will report the p-value for a model using the likelihood ratio test.

 

The nagelkerke function also reports the McFadden, Cox and Snell, and Nagelkerke pseudo R-squared values for the model.


library(rcompanion)

nagelkerke(model)


$Pseudo.R.squared.for.model.vs.null

                             Pseudo.R.squared
McFadden                             0.112227
Cox and Snell (ML)                   0.500939
Nagelkerke (Cragg and Uhler)         0.501964


$Likelihood.ratio.test

 Df.diff LogLik.diff  Chisq   p.value
      -2     -5.9077 11.815 0.0027184


Likelihood ratio test for p-value for model

The p-value for a model by the likelihood ratio test can also be determined with the lrtest function in the lmtest package.


library(lmtest)

lrtest(model)


  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   4 -46.733                       
2   2 -52.641 -2 11.815   0.002718 **


Simple plot of data and model


library(rcompanion)

plotPredy(data  = Data,
          y     = Words.per.minute,
          x     = Age,
          x2    = Age2,
          model = model,
          order = 2,
          xlab  = "Words per minute",
          ylab  = "Age",
          col   = "red")                ###  line color


image


Optional analyses:  Confidence intervals for p-values and R-squared values

 

It is relatively easy to produce confidence intervals for p-values, R-squared values, or other parameters from model fitting, such as coefficients for regression terms.  This can be accomplished with bootstrapping.  Here the boot.ci function from the boot package is used.

 

The code below is a little complicated, but relatively flexible.

 

Function can contain any function of interest, as long as it includes an input vector or data frame (input in this case) and an indexing variable (index in this case).  Stat is set to produce the actual statistic of interest on which to perform the bootstrap (r.squared from the summary of the lm in this case).

 

The code Function(Data, 1:n) is there simply to test Function on the data frame Data.  In this case, it will produce the output of Function for the first n rows of Data.  Since n is defined as the length of the first column in Data, this should return the value for Stat for the whole data frame, if Function is set up correctly.


Input =("
Age   Words.per.minute
 12    23
 12    32
 12    25
 13    27
 13    30
 15    29
 15    33
 16    37
 18    29
 22    33
 23    37
 24    33
 25    43
 27    35
 33    30
 42    25
 53    22
")

Data = read.table(textConnection(Input),header=TRUE)

Data$Age2 = Data$Age ^ 2


###  Check the data frame


library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Linear model

 

R-squared value


library(boot)

Function = function(input, index){
                    Input = input[index,]
                    Result = lm (Words.per.minute ~ Age + Age2,
                                          data = Input)
                    Stat = summary(Result)$r.squared
                    return(Stat)}


### Test Function

n = length(Data[,1])

Function(Data, 1:n)


[1] 0.5009385


### Produce Stat estimate by bootstrap

Boot = boot(Data,
             Function,
             R=5000)
            
mean(Boot$t[,1])


[1] 0.5754582


### Produce confidence interval by bootstrap

boot.ci(Boot,
        conf = 0.95,
        type = "perc")

            ### Options: "norm", "basic", "stud", "perc", "bca", "all"


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

Intervals :
Level     Percentile    
95%   ( 0.3796,  0.7802 ) 
Calculations and Intervals on Original Scale


### Other information

Boot

hist(Boot$t[,1],
     col = "darkgray")


p-value

Unfortunately the p-value for the model is not stored neatly in the lm model object, so it has to be calculated in the Stat definition.


library(boot)

Function = function(input, index){
                    Input = input[index,]
                    Result = lm (Words.per.minute ~ Age + Age2,
                                          data = Input)
                    Stat = pf(summary(Result)$fstatistic[1],
                              summary(Result)$fstatistic[2],
                              summary(Result)$fstatistic[3],
                              lower.tail = FALSE)
                    return(Stat)}


### Test Function

n = length(Data[,1])

Function(Data, 1:n)


      value
0.007710425


### Produce Stat estimate by bootstrap

Boot = boot(Data,
             Function,
             R=5000)
            
mean(Boot$t[,1])


[1] 0.00672866


### Produce confidence interval by bootstrap

boot.ci(Boot,
        conf = 0.95,
        type = "perc")

            ### Options: "norm", "basic", "stud", "perc", "bca", "all"


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

Intervals :
Level     Percentile    
95%   ( 0.0000,  0.0368 ) 
Calculations and Intervals on Original Scale


### Other information

Boot

hist(Boot$t[,1],
     col = "darkgray")


Generalized linear model

 

Pseudo R-squared value

The nagelkerke function produces a list.  The second item in the list is a matrix named

 

Pseudo.R.squared.for.model.vs.null. 

 

The third element of this matrix is the value for the Nagelkerke pseudo R-squared.  So,

 

 nagelkerke(Result, Null)[[2]][3]

 

 yields the value of the Nagelkerke pseudo R-squared.


library(boot)

library(rcompanion)

Function = function(input, index){
                    Input = input[index,]
                    Result = glm (Words.per.minute ~ Age + Age2,
                                          data = Input,
                                          family="gaussian")
                    Null = glm (Words.per.minute ~ 1,
                                          data = Input,
                                          family="gaussian")
                    Stat = nagelkerke(Result, Null)[[2]][3]
                    return(Stat)}


### Test Function

n = length(Data[,1])

Function(Data, 1:n)


[1] 0.501964


### Produce Stat estimate by bootstrap

Boot = boot(Data,
             Function,
             R=1000)

                 ### In this case, even 1000 iterations can take a while
            
mean(Boot$t[,1])            


[1] 0.5803598


### Produce confidence interval by bootstrap

boot.ci(Boot,
        conf = 0.95,
        type = "perc")

            ### Options: "norm", "basic", "stud", "perc", "bca", "all"


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

Intervals :
Level     Percentile    
95%   ( 0.3836,  0.7860 ) 
Calculations and Intervals on Original Scale


### Other information

Boot

hist(Boot$t[,1],
     col = "darkgray")


p-value


library(boot)

library(rcompanion)

Function = function(input, index){
                    Input = input[index,]
                    Result = glm (Words.per.minute ~ Age + Age2,
                                          data = Input,
                                          family="gaussian")
                    Null = glm (Words.per.minute ~ 1,
                                          data = Input,
                                          family="gaussian")
                    Stat = nagelkerke(Result, Null)[[3]][1,4]
                    return(Stat)}


### Test Function

n = length(Data[,1])

Function(Data, 1:n)


[1] 0.0027184


### Produce Stat estimate by bootstrap

Boot = boot(Data,
             Function,
             R=1000)

                 ### In this case, even 1000 iterations can take a while

mean(Boot$t[,1])


[1] 0.002723259


### Produce confidence interval by bootstrap

boot.ci(Boot,
        conf = 0.95,
        type = "perc")

            ### Options: "norm", "basic", "stud", "perc", "bca", "all"


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

Intervals :
Level     Percentile    
95%   ( 0.0000,  0.0173 ) 
Calculations and Intervals on Original Scale


### Other information

Boot

hist(Boot$t[,1],
     col = "darkgray")


References

 

Kabacoff, R.I. Bootstrapping. Quick-R. www.statmethods.net/advstats/bootstrapping.html.