When developing more complex models, it is often desirable to report a p-value for the model as a whole as well as an R-square for the model.
p-values for models
The p-value for a model determines the significance of the model compared with a null model. For a linear model, the null model is defined as the dependent variable being equal to its mean. So, the p-value for the model is answering the question, Does this model explain the values of the dependent variable significantly better than would just looking at the average value of the dependent variable?
R-squared and pseudo R-squared
The R-squared value is a measure of how well the model explains the data. It is an example of a goodness-of-fit statistic.
R-squared for linear (ordinary least squares) models
In R, models fit with the lm function are linear models fit with ordinary least squares (OLS). For these models, R-squared indicates the proportion of the variability in the dependent variable that is explained by model. That is, an R-squared of 0.60 indicates that 60% of the variability in the dependent variable is explained by the model.
Pseudo R-squared
For many types of models, R-squared is not defined. These include relatively common models like logistic regression and the cumulative link models used in this book. For these models, pseudo R-squared measures can be calculated. A pseudo R-squared is not directly comparable to the R-squared for OLS models. Nor can it be interpreted as the proportion of the variability in the dependent variable that is explained by model. Instead, pseudo R-squared measures are relative measures among similar models indicating how well the model explains the data.
A few common pseudo R-squared measures include: McFadden, Cox and Snell (also referred to as ML), Nagelkerke (also referred to as Cragg and Uhler), Efron, and count.
Some pseudo R-squared measures work better with some types of models. It takes some experience or research to determine which may work best in a given situation. McFadden, Cox and Snell, and Nagelkerke are based on the likelihood estimates of the model. Nagelkerke is the same as the Cox and Snell, except that the value is adjusted upward so that the Nagelkerke has a maximum value of 1. Efron’s pseudo R-squared has the advantage of being based solely on the actual values of the dependent variable and those values predicted by the model. This makes it relatively easy to understand. The count pseudo R-squared is used in cases of a binary predicted value, and simply compares the number of correct responses to the total number of responses.
p-values and R-squared values.
p-values and R-squared values measure different things. The p-value indicates if there is a significant relationship described by the model. Essentially, if there is enough evidence that the model explains the data better than would a null model. The R-squared measures the degree to which the data is explained by the model. It is therefore possible to get a significant p-value with a low R-squared value. This often happens when there are enough data points to supply sufficient evidence that the model is better than the null model (significant p-value), but variability in the dependent variable isn’t explained too well by the model (low R-squared).
Packages used in this chapter
The packages used in this chapter include:
• psych
• lmtest
• boot
• rcompanion
The following commands will install these packages if they are not already installed:
if(!require(psych)){install.packages("psych")}
if(!require(lmtest)){install.packages("lmtest")}
if(!require(boot)){install.packages("boot")}
if(!require(rcompanion)){install.packages("rcompanion")}
Example of model p-value, R-squared, and pseudo R-squared
The following example uses some hypothetical data of a sample of people for which typing speed (Words.per.minute) and age were measured. After plotting the data, we decide to construct a polynomial model with Words.per.minute as the dependent variable and Age and Age2 as the independent variables. Notice in this example that all variables are treated as interval/ratio variables, and that the independent variables are also continuous variables.
The data will first be fit with a linear model with the lm function. Passing this model to the summary function will display the p-value for the model and the R-squared value for the model.
The same data will then be fit with a generalized linear model with the glm function. This type of model allows more latitude in the types of data that can be fit, but in this example, we’ll use the family=gaussian option, which will mimic the model fit with the lm function, though the underlying math is different.
Importantly, the summary of the glm function does not produce a p-value for the model nor an R-squared for the model.
For the model fit with glm, the p-value can be determined with the anova function comparing the fitted model to a null model. The null model is fit with only an intercept term on the right side of the model. As an alternative, the nagelkerke function described below also reports a p-value for the model, using the likelihood ratio test.
There is no R-squared defined for a glm model. Instead, a pseudo R-squared can be calculated. The function nagelkerke produces pseudo R-squared values for a variety of models. It reports three types: McFadden, Cox and Snell, and Nagelkerke. The efronRSquared function reports Efron’s pseudo R-squared for a variety of models.
Note that these models make certain assumptions about the distribution of the underlying population data, but for simplicity, this example will ignore the need to determine if these assumptions are met.
Data = read.table(header=TRUE, stringsAsFactors=TRUE, text="
Age Words.per.minute
12 23
12 32
12 25
13 27
13 30
15 29
15 33
16 37
18 29
22 33
23 37
24 33
25 43
27 35
33 30
42 25
53 22
")
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
Plot the data
plot(Words.per.minute ~ Age,
data = Data,
pch=16,
xlab = "Age",
ylab = "Words per minute")
Prepare data
### Create new variable for the square of Age
Data$Age2 = Data$Age ^ 2
### Double check data frame
library(psych)
headTail(Data)
Age Words.per.minute Age2
1 12 23 144
2 12 32 144
3 12 25 144
4 13 27 169
... ... ... ...
14 27 35 729
15 33 30 1089
16 42 25 1764
17 53 22 2809
Linear model
model = lm (Words.per.minute ~ Age + Age2,
data=Data)
summary(model) ### Shows coefficients,
### p-value for model, and
R-squared
Multiple R-squared: 0.5009, Adjusted R-squared: 0.4296
F-statistic: 7.026 on 2 and 14 DF, p-value: 0.00771
### p-value and (multiple) R-squared value
Simple plot of data and model
For bivariate data, the function plotPredy will plot the data and the predicted line for the model. It also works for polynomial functions, if the order option is changed.
library(rcompanion)
plotPredy(data = Data,
y = Words.per.minute,
x = Age,
x2 = Age2,
model = model,
order = 2,
xlab = "Words per minute",
ylab = "Age")
Generalized linear model
model = glm (Words.per.minute ~ Age + Age2,
data = Data,
family = gaussian)
summary(model) ### Shows coefficients
Calculate p-value for model
In R, the most common way to calculate the p-value for a fitted model is to compare the fitted model to a null model with the anova function. The null model is usually formulated with just a constant on the right side.
null = glm (Words.per.minute ~ 1, ### Create
null model
data = Data, ### with
only a constant on the right side
family = gaussian)
anova(model,
null,
test="Chisq") ###
Tests options "Rao", "LRT",
### "Chisq", "F",
"Cp"
### But some work with only some model
types
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 14 243.07
2 16 487.06 -2 -243.99 0.0008882 ***
Calculate pseudo R-squared and p-value for model
An alternative test for the p-value for a fitted model, the nagelkerke function will report the p-value for a model using the likelihood ratio test.
The nagelkerke function also reports the McFadden, Cox and Snell, and Nagelkerke pseudo R-squared values for the model.
library(rcompanion)
nagelkerke(model)
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.112227
Cox and Snell (ML) 0.500939
Nagelkerke (Cragg and Uhler) 0.501964
$Likelihood.ratio.test
Df.diff LogLik.diff Chisq p.value
-2 -5.9077 11.815 0.0027184
We can obtain Efron’s pseudo R-squared from the efronRsquared function or the accuracy function.
The accuracy function also outputs several other measures of accuracy and error.
efronRSquared(model)
EfronRSquared
0.501
accuracy(model)
$Fit.criteria
Min.max.accuracy MAE MedAE MAPE MSE RMSE NRMSE.mean NRMSE.median
Efron.r.squared CV.prcnt
1 0.902 3.19 2.92 0.106 14.3 3.78 0.123 0.126
0.501 12.3
Likelihood ratio test for p-value for model
The p-value for a model by the likelihood ratio test can also be determined with the lrtest function in the lmtest package.
library(lmtest)
lrtest(model)
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -46.733
2 2 -52.641 -2 11.815 0.002718 **
Simple plot of data and model
library(rcompanion)
plotPredy(data = Data,
y = Words.per.minute,
x = Age,
x2 = Age2,
model = model,
order = 2,
xlab = "Words per minute",
ylab = "Age",
col = "red") ###
line color
Optional analyses: Confidence intervals for R-squared values
It is relatively easy to produce confidence intervals for R-squared values or other parameters from model fitting, such as coefficients for regression terms. This can be accomplished with bootstrapping. Here the boot.ci function from the boot package is used.
The code below is a little complicated, but relatively flexible.
Function can contain any function of interest, as long as it includes an input vector or data frame (input in this case) and an indexing variable (index in this case). Stat is set to produce the actual statistic of interest on which to perform the bootstrap (r.squared from the summary of the lm in this case).
The code Function(Data, 1:n) is there simply to test Function on the data frame Data. In this case, it will produce the output of Function for the first n rows of Data. Since n is defined as the length of the first column in Data, this should return the value for Stat for the whole data frame, if Function is set up correctly.
Data = read.table(header=TRUE, stringsAsFactors=TRUE, text="
Age Words.per.minute
12 23
12 32
12 25
13 27
13 30
15 29
15 33
16 37
18 29
22 33
23 37
24 33
25 43
27 35
33 30
42 25
53 22
")
Data$Age2 = Data$Age ^ 2
### Check the data frame
library(psych)
headTail(Data)
str(Data)
summary(Data)
Confidence intervals for r-squared by bootstrap
library(boot)
Function = function(input, index){
Input = input[index,]
Result = lm (Words.per.minute ~ Age + Age2,
data = Input)
Stat = summary(Result)$r.squared
return(Stat)}
### Test Function
n = length(Data[,1])
Function(Data, 1:n)
[1] 0.5009385
### Produce Stat estimate by bootstrap
Boot = boot(Data,
Function,
R=5000)
mean(Boot$t[,1])
[1] 0.5754582
### Produce confidence interval by
bootstrap
### (Note that values will varies for
each run)
boot.ci(Boot,
conf = 0.95,
type = "perc")
### Options: "norm", "basic", "stud",
"perc", "bca", "all"
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
Intervals :
Level Percentile
95% ( 0.3796, 0.7802 )
Calculations and Intervals on Original Scale
### Other information
Boot
hist(Boot$t[,1])
Confidence intervals for Efron’s pseudo r-squared by bootstrap
The efronRSquared function produces a single statistic. This makes it relatively easy to pass to the boot function.
library(boot)
library(rcompanion)
Function = function(input, index){
Input = input[index,]
Result = lm(Words.per.minute ~ Age + Age2,
data = Input)
Stat = efronRSquared(Result)
return(Stat)}
### Test Function
n = length(Data[,1])
Function(Data, 1:n)
[1] 0.501
### Produce Stat estimate by bootstrap
Boot = boot(Data,
Function,
R=1000)
mean(Boot$t[,1])
[1] 0.575913
### Produce confidence interval by
bootstrap
boot.ci(Boot,
conf = 0.95,
type = "perc")
### Options: "norm",
"basic", "stud", "perc", "bca",
"all"
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = Boot, conf = 0.95, type = "perc")
Intervals :
Level Percentile
95% ( 0.3930, 0.7639 )
Calculations and Intervals on Original Scale
### Other information
Boot
hist(Boot$t[,1],
col = "darkgray")
Confidence intervals for pseudo r-squared by bootstrap
The nagelkerke function produces a list. The second item in the list is a matrix named
Pseudo.R.squared.for.model.vs.null.
The third element of this matrix is the value for the Nagelkerke pseudo R-squared. So,
nagelkerke(Result, Null)[[2]][3]
yields the value of the Nagelkerke pseudo R-squared.
library(boot)
library(rcompanion)
Function = function(input, index){
Input = input[index,]
Result = glm (Words.per.minute ~ Age + Age2,
data = Input,
family="gaussian")
Null = glm (Words.per.minute ~ 1,
data = Input,
family="gaussian")
Stat = nagelkerke(Result, Null)[[2]][3]
return(Stat)}
### Test Function
n = length(Data[,1])
Function(Data, 1:n)
[1] 0.501964
### Produce Stat estimate by bootstrap
Boot = boot(Data,
Function,
R=1000)
### In this case, even 1000 iterations
can take a while
mean(Boot$t[,1])
[1] 0.5803598
### Produce confidence interval by
bootstrap
boot.ci(Boot,
conf = 0.95,
type = "perc")
### Options: "norm",
"basic", "stud", "perc", "bca",
"all"
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
Intervals :
Level Percentile
95% ( 0.3836, 0.7860 )
Calculations and Intervals on Original Scale
### Other information
Boot
hist(Boot$t[,1],
col = "darkgray")
References
IDRE . 2011. FAQ: What are pseudo R-squareds? UCLA. stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/.
Kabacoff, R.I. 2017. Bootstrapping. Quick-R. www.statmethods.net/advstats/bootstrapping.html.