Linear models are used for a wide variety of statistical analyses. The basic concept is that a dependent variable can be predicted from a set of independent variables that are related in a linear fashion. This framework leads to models that are very flexible and can include a variety of continuous or categorical independent variables.
All linear models make some assumptions about the underlying population from which the data are sampled. For one thing, the sampled population should accord with the fact that the model is composed of a linear combination of the effects. Non-linearity would suggest that the model effects should be modified or that a different kind of model should be used. In addition, each type of linear model usually makes some assumptions about the distribution of the population. For the reported statistics to be valid, it is essential to understand and either check these assumptions or have a basis for accepting these assumptions for the specific type of model being used.
Types of linear models
In general, the type of model to be used is determined by the nature of the dependent variable.
General linear models
Readers may be familiar with linear regression, multiple linear regression, or analysis of variance (ANOVA).
These models can be considered part of larger category of linear models called general linear model. The dependent variable is continuous interval/ratio, and there are assumptions about the distribution of the sampled population. These models are usually called parametric models or tests, although technically other types of models which assume properties about the parameters of the model are also parametric in nature.
Generalized linear models
Models for other types of dependent variables can be developed in a generalized linear model framework. This approach is similar to general linear model approach, except that there are different assumptions about the nature of the dependent variable or the distribution of its population.
For example, ordinal dependent variables can be modeled with cumulative link models. Binary (yes/no) dependent variables can be modeled with logistic regression. Dependent variables of discrete counted quantities can be modeled with Poisson regression and similar techniques. Percent and proportion data can be modeled with beta regression.
Other linear models
Not all linear models are included in the general linear model and generalized linear model categories. For example, common quantile regression is a type of linear model not included in these categories.
For any type of linear model, some method is used to find the value for the parameters for the model which best fit the data. For simple models like linear regression or a simple analysis of variance, these parameters estimates could be found by hand calculations. Luckily, more complex models can be fit by computer algorithm. General linear models are typically fit by ordinary least squares (OLS), whereas generalized linear models are typically fit by maximum likelihood estimation (MLE). OLS is actually a specific case of MLE that is valid only when the conditional distribution of the sampled population is normal. These approaches to estimation are generally reliable, but the reader should be aware that there are cases where MLE may fail.
Formulae for specifying models in R
Most packages for specifying types of models in R use a similar grammar in the model formula.
y ~ x1 + x2 + x1:x2
Specifies y as the dependent variables, and x1, x2, and the interaction of x1 and x2 as the independent variables.
y ~ x1 | g
Specifies g as a stratification variable or as a conditional variable, that is x1 given g.
The symbol 1 specifies an intercept for the model, so that
y ~ x1 - 1
indicates that the intercept should not be included in the model, and
y ~ 1
indicates that only the intercept is to be included in the right hand side of this model.
The King article in the “References” section has some more detail on model specification syntax in R.
Data in data frames
Usually the variables in the model will be taken from a data frame. For example
y ~ x
will look for y and x in the global environment, whereas
y ~ x, data = Data
will look for y and x in the data frame called Data.
The syntax varies somewhat for random effects in models across packages, but
y ~ x, random = ~1|Subject
y ~ x +(1|Subject)
Specifies that y is the dependent variable, x is an independent variable, and Subject is an independent variable treated as random variable, specifically with an intercept fit for each level of Subject.
More on specifying random effects is discussed in the chapter Repeated Measures ANOVA.
Extracting model information from R
Each package varies on the methods used to extract information about the model, but some are relatively common across several packages. For other types of model objects, there may be methods to extract similar information with different functions.
Assuming that model has been defined as a model object by an appropriate function,
produces a summary of the model with estimates of the coefficients, and sometimes other useful information like a p-value for the model, or an r-squared or pseudo R-squared for the model.
for some models, will produce an analysis of variance table, or an analysis of deviance table.
will produce plots of the model, usually diagnostic plots.
will compare two models by an analysis of variance or other test.
will report predicted values in the dependent variable from the model for each observation that went in to the model.
will report residual values from the model.
reports the structure of the information stored in the model object.
King, W.B. Model Formulae. ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html.