[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Advanced Parametric Methods

Packages used in this chapter


The packages used in this chapter include:

•  robustbase

•  psych

•  minpack.lm

•  car

•  rcompanion

 

The following commands will install these packages if they are not already installed:


if(!require(robustbase)){install.packages("robustbase")}
if(!require(pysch)){install.packages("pysch")}
if(!require(nlstools)){install.packages("nlstools")}
if(!require(minpack.lm)){install.packages("minpack.lm")}
if(!require(car)){install.packages("car")}
if(!require(rcompanion)){install.packages("rcompanion")}


More complex experimental designs

 

Except for t-tests, the approach of this book for parametric statistics has been to develop linear models (with the lm function) or mixed effects models (with the nlme or lme4 packages) and then to apply analysis of variance, model checking, and post-hoc testing.

 

These relatively simple models can be expanded to include more independent variables of either continuous or factor type.  Likewise, more complex designs can include nested and crossed factors.  Some traditional experimental designs include Latin square, split plot, and incomplete block.

 

Analysis of co-variance.

 

Analysis of covariance combines continuous independent variables with factor independent variables. 

 

A related design is the paired watershed design which, while it was developed for watershed studies, can be adapted to various situations in a variety of fields.

 

Analysis of Covariance in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_04.html.

 

Clausen, J.C. and J. Spooner. 1993. Paired Watershed Study Design. 841-F-93-009. United States Environmental Protection Agency, Office of Water. Washington, DC. nepis.epa.gov/Exe/ZyPURL.cgi?Dockey=20004PR6.TXT.

 

Dressing, S.A. and D.W. Meals. 2005. Designing water quality monitoring programs for watershed projects. U.S. Environmental Protection Agency. Fairfax, VA. www.epa.gov/sites/production/files/2015-10/documents/technote2_wq_monitoring.pdf.

 

Nonlinear regression and curvilinear regression

 

Nonlinear and curvilinear regression are similar to linear regression, except that the relationship between the dependent variable and the independent variable is curved in some way.  Specific methods include polynomial regression, spline regression, and nonlinear regression.

 

Polynomial regression was covered briefly in the previous chapter, while some examples of curvilinear regression are shown below in the “Linear plateau and quadratic plateau models” section in this chapter. 

 

For further discussion, see:

 

Curvilinear Regression in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_03.html.

 

Multiple regression

 

Multiple regression is similar to linear regression, except that the model contains multiple independent variables.

 

The polynomial regression example in the previous chapter is one example of multiple regression. 

 

For further discussion, see:

 

Multiple Regression in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_05.html.

 

Logistic regression

 

Logistic regression is similar to linear regression, except that the dependent variable is categorical.  In standard logistic regression, the dependent variable has only two levels, and in multinomial logistic regression, the dependent variable can have more than two levels.

 

For examples of logistic regression, see the chapter Models for Nominal Data in this book.

 

Analysis of count data

 

When the dependent variable is a counted quantity and not a measured quantity, it is appropriate to use Poisson regression or related techniques.

 

For a discussion of when standard parametric approaches may or may not be used with count data, see the “Count data may not be appropriate for common parametric tests” section in the Introduction to Parametric Tests chapter.

 

For examples of appropriate ways to model count data, see the Hermite and Poisson Regression for Count Data chapter in this book.

 

Robust techniques

 

Standard parametric analyses can be sensitive to outliers and other deviations from assumptions of the analyses.  The example below shows that a single point can affect the predicted line from standard linear regression.  It is said that this point has high leverage, as so has high influence in the model.

 

One approach in these cases is to use nonparametric techniques. 

 

Another approach is to adjust the model to better fit the data.

 

A third way is to determine if certain data points are true outliers, and then give them further examination. 

 

A fourth way is to use a robust technique such as robust regression.  The references below discuss robust regression and robust techniques appropriate for designs lending themselves to an analysis of variance approach.

 

One-way Analysis with Permutation Test in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/d_06a.html.

 

Two-way Anova with Robust Estimation in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/d_08a.html.

 

Search for Robust regression in the following:

Correlation and Linear Regression in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/e_01.html.

 

Example of robust linear regression

Note that this example does not include checking assumptions of the model, but is intended only to compare the results of ordinary linear regression and robust regression when there is a data point with high influence.


Input=("
X    Y
0    0
1    2
2    5
3    6
4    9
5   10
6   11
7   14
8   16
9   20
10  44
")

Data = read.table(textConnection(Input),header=TRUE)

Linear regression


model = lm(Y ~ X,
           data = Data)

summary(model)


Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -3.1364     3.6614  -0.857 0.413888   
X             3.1182     0.6189   5.038 0.000701 ***

Residual standard error: 6.491 on 9 degrees of freedom
Multiple R-squared:  0.7383,  Adjusted R-squared:  0.7092
F-statistic: 25.39 on 1 and 9 DF,  p-value: 0.0007013


Calculate statistics and plot


Pvalue = pf(summary(model)$fstatistic[1],
            summary(model)$fstatistic[2],
            summary(model)$fstatistic[3],
            lower.tail = FALSE)

R2     = summary(model)$r.squared


t1     = paste0("p-value: ", signif(Pvalue, digits=3))
t2     = paste0("R-squared: ", signif(R2, digits=3))
t3     = paste0("Intercept: ", signif(coef(model)[1], digits=3))
t4     = paste0("Slope: ", signif(coef(model)[2], digits=3))

plot(Y ~ X,
     data = Data,
     pch  = 16)

abline(model,
       col="blue",
       lwd=2)

text(0, 38, labels = t1, pos=4)
text(0, 33, labels = t2, pos=4)
text(0, 28, labels = t3, pos=4)
text(0, 23, labels = t4, pos=4)





Cook’s distance

Cook’s distance is a measure of influence of each data point, specifically determining how much predicted values would change if the observation were deleted.  It is sometimes recommended that a Cook’s distance of 1 or more merits further examination of the data point.  However, there are other recommendations for critical values for Cook’s distance.

 

In this case, the high Cook’s distance for observation 11 makes it a candidate for further evaluation.


barplot(cooks.distance(model))





Plot of residuals

This plot suggests that the residuals are not independent of the fitted values.  We would probably want to adjust the model by adding additional terms, using a nonlinear or curvilinear approach, or using a robust analysis to minimize the influence of certain data points.


plot(fitted(model),
     residuals(model))





Robust regression


library(robustbase)

model.r = lmrob(Y ~ X,
                data = Data)

summary(model.r)


Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.09015    0.32804   0.275     0.79   
X            2.04249    0.10970  18.619  1.7e-08 ***

Robust residual standard error: 0.9378
Multiple R-squared:  0.9817,  Adjusted R-squared:  0.9796


Calculate statistics and plot


model.null = lmrob(Y ~ 1,
                   data = Data)

Pvalue = anova(model.r, model.null)[4][2,1]

R2     = summary(model.r)$r.squared

t1     = paste0("p-value: ", signif(Pvalue, digits=3))
t2     = paste0("R-squared: ", signif(R2, digits=3))
t3     = paste0("Intercept: ", signif(coefficients(model.r)[1], digits=3))
t4     = paste0("Slope: ", signif(coefficients(model.r)[2], digits=3))

plot(Y ~ X,
     data = Data,
     pch  = 16)
abline(model.r,
       col="blue",
       lwd=2)
text(0, 38, labels = t1, pos=4)
text(0, 33, labels = t2, pos=4)
text(0, 28, labels = t3, pos=4)
text(0, 23, labels = t4, pos=4)





Kendall Theil nonparametric linear regression

 

Kendall–Theil regression is a completely nonparametric approach to linear regression.  It is robust to outliers in the y values.  It simply computes all the lines between each pair of points, and uses the median of the slopes of these lines.  This method is sometimes called Theil–Sen.  A modified, and preferred, method is named after Siegel.

 

The mblm function in the mblm package uses the Siegel method by default.  The Theil–Sen procedure can be chosen with the repeated=FALSE option. See library(mblm); ?mblm for more details.


Input=("
X    Y
0    0
1    2
2    5
3    6
4    9
5   10
6   11
7   14
8   16
9   20
10  44
")

Data = read.table(textConnection(Input),header=TRUE)


library(mblm)

model = mblm(Y ~ X,
             data=Data)

summary(model)


Coefficients:
            Estimate MAD V value Pr(>|V|)  
(Intercept)        0   0       6  0.78741  
X                  2   0      66  0.00327 **


Pvalue    = as.numeric(summary(model)$coefficients[2,4])
Intercept = as.numeric(summary(model)$coefficients[1,1])
Slope     = as.numeric(summary(model)$coefficients[2,1])
R2        = NULL

t1     = paste0("p-value: ", signif(Pvalue, digits=3))
t2     = paste0("R-squared: ", "NULL")
t3     = paste0("Intercept: ", signif(Intercept, digits=3))
t4     = paste0("Slope: ", signif(Slope, digits=3))

plot(Y ~ X,
     data = Data,
     pch  = 16)

abline(model,
       col="blue",
       lwd=2)

text(1, 40, labels = t1, pos=4)
text(1, 36, labels = t2, pos=4)
text(1, 32, labels = t3, pos=4)
text(1, 28, labels = t4, pos=4)





 

Contrasts in Linear Models

 

Contrasts can be used in linear models to compare groups of treatments to one another.  For example, if you were testing four curricula and wanted to compare the effect of One and Two vs. Three and Four, you could accomplish this with a single-degree-of freedom contrast.

 

Contrasts in Linear Models in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/h_01.html.

 

Linear plateau and quadratic plateau models

 

Linear plateau and quadratic plateau models are segmented models for bivariate data, typically where the y-value increases with an increase in x to some critical value, beyond which the y-value ceases to increase.  The statistics of interest include the critical value (the x-value above which there is no further increase in y), and the plateau value (the statistically highest value that y reaches).

 

For another example of these models, see the “Use the best model” section of the Basic Plots chapter.

 

Also included in this section is an example of curvilinear regression, with a Mitscherlich–Bray model fit to the same data.

 

Examples of linear plateau and quadratic plateau models


Input = ("
Instructor       Grade   Weight  Calories Sodium  Score
'Brendon Small'     6      43     2069    1287      77
'Brendon Small'     6      41     1990    1164      76
'Brendon Small'     6      40     1975    1177      76
'Brendon Small'     6      44     2116    1262      84
'Brendon Small'     6      45     2161    1271      86
'Brendon Small'     6      44     2091    1222      87
'Brendon Small'     6      48     2236    1377      90
'Brendon Small'     6      47     2198    1288      78
'Brendon Small'     6      46     2190    1284      89
'Jason Penopolis'   7      45     2134    1262      76
'Jason Penopolis'   7      45     2128    1281      80
'Jason Penopolis'   7      46     2190    1305      84
'Jason Penopolis'   7      43     2070    1199      68
'Jason Penopolis'   7      48     2266    1368      85
'Jason Penopolis'   7      47     2216    1340      76
'Jason Penopolis'   7      47     2203    1273      69
'Jason Penopolis'   7      43     2040    1277      86
'Jason Penopolis'   7      48     2248    1329      81
'Melissa Robins'    8      48     2265    1361      67
'Melissa Robins'    8      46     2184    1268      68
'Melissa Robins'    8      53     2441    1380      66
'Melissa Robins'    8      48     2234    1386      65
'Melissa Robins'    8      52     2403    1408      70
'Melissa Robins'    8      53     2438    1380      83
'Melissa Robins'    8      52     2360    1378      74
'Melissa Robins'    8      51     2344    1413      65
'Melissa Robins'    8      51     2351    1400      68
'Paula Small'       9      52     2390    1412      78
'Paula Small'       9      54     2470    1422      62
'Paula Small'       9      49     2280    1382      61
'Paula Small'       9      50     2308    1410      72
'Paula Small'       9      55     2505    1410      80
'Paula Small'       9      52     2409    1382      60
'Paula Small'       9      53     2431    1422      70
'Paula Small'       9      56     2523    1388      79
'Paula Small'       9      50     2315    1404      71
'Coach McGuirk'    10      52     2406    1420      68
'Coach McGuirk'    10      58     2699    1405      65
'Coach McGuirk'    10      57     2571    1400      64
'Coach McGuirk'    10      52     2394    1420      69
'Coach McGuirk'    10      55     2518    1379      70
'Coach McGuirk'    10      52     2379    1393      61
'Coach McGuirk'    10      59     2636    1417      70
'Coach McGuirk'    10      54     2465    1414      59
'Coach McGuirk'    10      54     2479    1383      61
")

Data = read.table(textConnection(Input),header=TRUE)

###  Order factors by the order in data frame
###  Otherwise, R will alphabetize them


Data$Instructor = factor(Data$Instructor,
                         levels=unique(Data$Instructor))


###  Check the data frame

library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Linear plateau model

The nls function in the native stats package can fit nonlinear and curvilinear functions.  To accomplish this, a function—linplat, here—will be defined with the x and y variables (Calories and Sodium) along with parameters (a, b, and clx).  It is the goal of the nls function to find the best fit values for these parameters for these data.

 

The code below starts with an attempt to find reasonable starting values for the parameters.  If these starting values don’t yield appropriate results, other starting values can be tried.

 

The fitted parameters a, b, and clx designate the best fit intercept, slope, and critical x value.  The plateau value is calculated as a + b * clx.


###  Find reasonable initial values for parameters

fit.lm    = lm(Sodium ~ Calories, data=Data)

a.ini     = fit.lm$coefficients[1]
b.ini     = fit.lm$coefficients[2]
clx.ini   = mean(Data$Calories)


###  Define linear plateau function

linplat = function(x, a, b, clx)
          {ifelse(x < clx, a + b * x,
                           a + b * clx)}


###  Find best fit parameters

model = nls(Sodium ~ linplat(Calories, a, b, clx),
            data = Data,
            start = list(a   = a.ini,
                         b   = b.ini,
                         clx = clx.ini),
             trace = FALSE,
             nls.control(maxiter = 1000))

summary(model)


Parameters:
      Estimate Std. Error t value Pr(>|t|)   
a   -128.85535  116.10819   -1.11    0.273   
b      0.65768    0.05343   12.31  1.6e-15 ***
clx 2326.51521   16.82086  138.31  < 2e-16 ***


p-value and pseudo R-squared


###  Define null model

nullfunct = function(x, m){m}

m.ini    = mean(Data$Sodium)

null = nls(Sodium ~ nullfunct(Calories, m),
           data = Data,
           start = list(m   = m.ini),
           trace = FALSE,
           nls.control(maxiter = 1000))


###  Find p-value and pseudo R-squared

library(rcompanion)

nagelkerke(model,
           null)


$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.195417
Cox and Snell (ML)                   0.892004
Nagelkerke (Cragg and Uhler)         0.892014

$Likelihood.ratio.test
 Df.diff LogLik.diff  Chisq   p.value
      -2     -50.077 100.15 1.785e-22


Determine confidence intervals for parameter estimates


library(nlstools)

confint2(model,
         level = 0.95)


           2.5 %       97.5 %
a   -363.1711583  105.4604611
b      0.5498523    0.7654999
clx 2292.5693351 2360.4610904


library(nlstools)

Boot = nlsBoot(model)

summary(Boot)


------
Bootstrap statistics
        Estimate   Std. error
a   -143.5189740 111.18292997
b      0.6643267   0.05132983
clx 2326.8124177  16.71482729

------
Median of bootstrap estimates and percentile confidence intervals
          Median         2.5%       97.5%
a   -143.5474796 -376.3224461   73.437548
b      0.6638483    0.5652069    0.771089
clx 2327.1279329 2292.5244830 2363.929178


Plot data with best fit line


library(rcompanion)

plotPredy(data  = Data,
          x     = "Calories",
          y     = "Sodium",
          model = model,
          xlab  = "Calories per day",
          ylab  = "Sodium intake per day")




A plot of daily sodium intake vs. daily caloric intake for students.  Best fit linear plateau model is shown.  Critical x value = 2327 calories, and Plateau y = 1401 mg. p < 0.0001.  Pseudo R-squared (Nagelkerke) = 0.892.


Plots of residuals


x = residuals(model)


library(rcompanion)

plotNormalHistogram(x)





plot(fitted(model),
     residuals(model))





Quadratic plateau model

A quadratic plateau model is similar to a linear plateau model, except that the linear segment is replaced with a quadratic function.

 

The fitted parameters a, b, and clx designate the best fit intercept, linear coefficient, and critical x value.  The quadratic coefficient is calculated as –0.5 * b / clx.  The plateau value is calculated as a + b * clx –0.5 * b * clx.


###  Find reasonable initial values for parameters

fit.lm    = lm(Sodium ~ Calories, data=Data)

a.ini     = fit.lm$coefficients[1]
b.ini     = fit.lm$coefficients[2]
clx.ini   = mean(Data$Calories)


###  Define quadratic plateau function

quadplat = function(x, a, b, clx) {
           ifelse(x  < clx, a + b * x   + (-0.5*b/clx) * x   * x,
                            a + b * clx + (-0.5*b/clx) * clx * clx)}


###  Find best fit parameters


model = nls(Sodium ~ quadplat(Calories, a, b, clx),
            data = Data,
            start = list(a   = a.ini,
                         b   = b.ini,
                         clx = clx.ini),
             trace = FALSE,
             nls.control(maxiter = 1000))

summary(model)


Parameters:
      Estimate Std. Error t value Pr(>|t|)   
a   -4217.7196   935.1139  -4.510 5.13e-05 ***
b       4.4920     0.8251   5.444 2.49e-06 ***
clx  2504.4059    48.9679  51.144  < 2e-16 ***


p-value and pseudo R-squared


###  Define null model

nullfunct = function(x, m){m}

m.ini    = mean(Data$Sodium)

null = nls(Sodium ~ nullfunct(Calories, m),
           data = Data,
           start = list(m   = m.ini),
           trace = FALSE,
           nls.control(maxiter = 1000))


###  Find p-value and pseudo R-squared

library(rcompanion)

nagelkerke(model,
           null)


$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.175609
Cox and Snell (ML)                   0.864674
Nagelkerke (Cragg and Uhler)         0.864683

$Likelihood.ratio.test
 Df.diff LogLik.diff  Chisq    p.value
      -2     -45.001 90.003 2.8583e-20


Determine confidence intervals for parameter estimates


library(nlstools)

confint2(model,
         level = 0.95)


           2.5 %       97.5 %
a   -6104.855900 -2330.583368
b       2.826818     6.157185
clx  2405.584577  2603.227200


library(nlstools)

Boot = nlsBoot(model)

summary(Boot)


------
Bootstrap statistics
        Estimate  Std. error
a   -4294.518746 956.6204011
b       4.560459   0.8448001
clx  2510.154733  53.3134728

------
Median of bootstrap estimates and percentile confidence intervals
          Median        2.5%        97.5%
a   -4250.030936 -6311.47965 -2546.971334
b       4.515874     3.03532     6.359601
clx  2504.795426  2425.75206  2626.813217


Plot data with best fit line


library(rcompanion)

plotPredy(data  = Data,
          x     = "Calories",
          y     = "Sodium",
          model = model,
          xlab  = "Calories per day",
          ylab  = "Sodium intake per day")




A plot of daily sodium intake vs. daily caloric intake for students.  Best fit quadratic plateau model is shown.  Critical x value = 2504 calories, and Plateau y = 1403 mg. p < 0.0001.  Pseudo R-squared (Nagelkerke) = 0.865.


Plots of residuals


x = residuals(model)


library(rcompanion)

plotNormalHistogram(x)





plot(fitted(model),
     residuals(model))





Mitscherlich–Bray  model

This model is an example of curvilinear regression, specifically with a general exponential model.  The model here is specified as y = abecx.

 

In the function below, the constant D is set to the minimum of Calories + 1.  This will be subtracted from the x values because the numerically large values for Calories make the fitting of the exponential function difficult without doing this subtraction.


###  Find reasonable initial values for parameters

a.ini   = 1400
b.ini   = -120
c.ini   = 0.01

###  Set a constant to adjust x values

D       = 1974

###  Define exponential function

expon = function(x, a, b, c) {
                 a + b * exp(-1 * c * (x - D))}


###  Find best fit parameters


model = nls(Sodium ~ expon(Calories, a, b, c),
            data = Data,
            start = list(a = a.ini,
                         b = b.ini,
                         c = c.ini),
             trace = FALSE,
             nls.control(maxiter = 10000))

summary(model)


Parameters:
    Estimate Std. Error t value Pr(>|t|)   
a  1.450e+03  2.309e+01  62.790  < 2e-16 ***
b -2.966e+02  2.142e+01 -13.846  < 2e-16 ***
c  3.807e-03  7.973e-04   4.776  2.2e-05 ***


p-value and pseudo R-squared


###  Define null model

nullfunct = function(x, m){m}

m.ini    = mean(Data$Sodium)

null = nls(Sodium ~ nullfunct(Calories, m),
           data = Data,
           start = list(m   = m.ini),
           trace = FALSE,
           nls.control(maxiter = 1000))


###  Find p-value and pseudo R-squared

library(rcompanion)

nagelkerke(model,
           null)


$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.162515
Cox and Snell (ML)                   0.842908
Nagelkerke (Cragg and Uhler)         0.842918

$Likelihood.ratio.test
 Df.diff LogLik.diff  Chisq    p.value
      -2     -41.646 83.292 8.1931e-19


Determine confidence intervals for parameter estimates


library(nlstools)

confint2(model,
         level = 0.95)


          2.5 %        97.5 %
a  1.403060e+03  1.496244e+03
b -3.398171e+02 -2.533587e+02
c  2.198512e-03  5.416355e-03


library(nlstools)

Boot = nlsBoot(model)

summary(Boot)


------
Bootstrap statistics
       Estimate   Std. error
a  1.453109e+03 2.522906e+01
b -3.005815e+02 2.244473e+01
c  3.850955e-03 7.832523e-04

------
Median of bootstrap estimates and percentile confidence intervals
         Median          2.5%         97.5%
a  1.449234e+03  1.415464e+03  1.513570e+03
b -2.998719e+02 -3.474264e+02 -2.566514e+02
c  3.827519e-03  2.390379e-03  5.410459e-03


Plot data with best fit line


library(rcompanion)

plotPredy(data  = Data,
          x     = "Calories",
          y     = "Sodium",
          model = model,
          xlab  = "Calories per day",
          ylab  = "Sodium intake per day")




A plot of daily sodium intake vs. daily caloric intake for students.  Best fit exponential model is shown.  p < 0.0001.  Pseudo R-squared (Nagelkerke) = 0.843.


Plots of residuals


x = residuals(model)

library(rcompanion)

plotNormalHistogram(x)





plot(fitted(model),
     residuals(model))





Better functions to fit nonlinear regression

Finding reasonable starting values for some functions can be difficult, and the nls function may not find a solution for certain functions if starting values are not close to their best values.  One issue is that the Gauss–Newton method that nls uses will fail in some situations. 

 

Typically, I will put my data into a spreadsheet and make a plot of the function I am trying to fit.  I can then change parameter values and watch the curve change until I get the curve reasonably close to the data.

 

The package minpack.lm has a function nlsLM, which uses the Levenberg–Marquardt method.  It is more successful at fitting parameters for difficult functions when the initial values for parameters are poor.  An example of using this package is shown in the “Fitting curvilinear models with the minpack.lm package” below.

 

(The package nlmrt uses the Marquardt–Nash method, which is also tolerant of difficult functions and poor starting values for parameters.  However, at the time of writing, this package is not available for newer versions of R.)

 
Fitting curvilinear models with the minpack.lm package

For a plot of the best fit line and plots of residuals, see the previous section of this chapter.


library(minpack.lm)


###  Define exponential function

expon = function(x, a, b, c) {
                 a + b * exp(-1 * c * (x - D))}


###  Find best fit parameters


model = nlsLM(Sodium ~ expon(Calories, a, b, c),
            data = Data,
            start = list(a = 1400,
                         b = -1,
                         c = 0.1))

summary(model)


Parameters:
    Estimate Std. Error t value Pr(>|t|)   
a  1.450e+03  2.309e+01  62.789  < 2e-16 ***
b -2.966e+02  2.142e+01 -13.846  < 2e-16 ***
c  3.807e-03  7.972e-04   4.776  2.2e-05 ***


Cate–Nelson analysis

 

Cate–Nelson analysis is a useful approach for some bivariate data that don’t fit neatly into linear model, linear–plateau model, or curvilinear model.  The analysis divides the data into two populations: those x values that are likely to have high y values, and those x values that are likely to have low y values.

 

For more discussion and examples, see:

Cate–Nelson Analysis in Mangiafico, S.S. 2015. An R Companion for the Handbook of Biological Statistics, version 1.09. rcompanion.org/rcompanion/h_02.html.



library(rcompanion)

cateNelson(x          = Data$Calories,
           y          = Data$Sodium,
           plotit     = TRUE,
           hollow     = TRUE,
           xlab       = "Calories per day",
           ylab       = "Sodium intake, mg per day",
           trend      = "positive",
           clx        = 1,
           cly        = 1,
           xthreshold = 0.10,
           ythreshold = 0.15)


Final model:
 
   n    CLx       SS  CLy Q.I Q.II Q.III Q.IV  Fisher.p.value
1 45 2209.5 190716.1 1317   0   30     0   15  2.899665e-12



A plot of daily sodium intake vs. daily caloric intake for students, with a Cate–Nelson analysis applied.  Students with a Caloric intake greater than 2210 calories tended to have a sodium intake greater than 1317 mg per day.