Linear models are used for a wide variety of statistical analyses. The basic concept is that a dependent variable can be predicted from a set of independent variables that are related in a linear fashion. This framework leads to models that are very flexible and can include a variety of continuous or categorical independent variables.

All linear models make some assumptions about the underlying population from which the data are sampled. For one thing, the sampled population should accord with the fact that the model is composed of a linear combination of the effects. Non-linearity would suggest that the model effects should be modified or that a different kind of model should be used. In addition, each type of linear model usually makes some assumptions about the distribution of the population. For the reported statistics to be valid, it is essential to understand and either check these assumptions or have a basis for accepting these assumptions for the specific type of model being used.

### Types of linear models

In general, the type of model to be used is determined by the nature of the dependent variable.

General linear models

Readers may be familiar with linear regression, multiple linear regression, or analysis of variance (ANOVA).

These models can be considered part of larger category of
linear models called *general linear model*. The dependent variable is
continuous interval/ratio, and there are assumptions about the distribution of
the sampled population. These models are usually called *parametric*
models or tests, although technically other types of models which assume
properties about the parameters of the model are also parametric in nature.

Generalized linear models

Models for other types of dependent variables can be
developed in a *generalized linear model* framework. This approach is similar
to general linear model approach, except that there are different assumptions
about the nature of the dependent variable or the distribution of its
population.

For example, ordinal dependent variables can be modeled with cumulative link models. Binary (yes/no) dependent variables can be modeled with logistic regression. Dependent variables of discrete counted quantities can be modeled with Poisson regression and similar techniques. Percent and proportion data can be modeled with beta regression.

Other linear models

Not all linear models are included in the *general linear
model* and *generalized linear model* categories. For example, common
quantile regression is a type of linear model not included in these categories.

Fitting models

For any type of linear model, some method is used to find the value for the parameters for the model which best fit the data. For simple models like linear regression or a simple analysis of variance, these parameters estimates could be found by hand calculations. Luckily, more complex models can be fit by computer algorithm. General linear models are typically fit by ordinary least squares (OLS), whereas generalized linear models are typically fit by maximum likelihood estimation (MLE). OLS is actually a specific case of MLE that is valid only when the conditional distribution of the sampled population is normal. These approaches to estimation are generally reliable, but the reader should be aware that there are cases where MLE may fail.

### Formulae for specifying models in R

Most packages for specifying types of models in R use a similar grammar in the model formula.

The formula

y ~ x1 + x2 + x1:x2

Specifies y as the dependent variables, and *x1*, *x2*,
and the interaction of *x1* and *x2* as the independent variables.

The formula

y ~ x1 | g

Specifies *g* as a stratification variable or as a
conditional variable, that is *x1* given *g*.

The symbol *1* specifies an intercept for the model, so
that

y ~ x1 - 1

indicates that the intercept should not be included in the model, and

y ~ 1

indicates that only the intercept is to be included in the right hand side of this model.

The King article in the “References” section has some more detail on model specification syntax in R.

#### Data in data frames

Usually the variables in the model will be taken from a data frame. For example

y ~ x

will look for *y* and *x* in the global
environment, whereas

y ~ x, data = Data

will look for *y* and *x* in the data frame called
*Data.*

#### Random effects

The syntax varies somewhat for random effects in models across packages, but

y ~ x, random = ~1|Subject

or

y ~ x +(1|Subject)

Specifies that *y* is the dependent variable, *x*
is an independent variable, and *Subject* is an independent variable treated
as random variable, specifically with an intercept fit for each level of *Subject*.

More on specifying random effects is discussed in the
chapter *Repeated Measures ANOVA*.

### Extracting model information from R

Each package varies on the methods used to extract information about the model, but some are relatively common across several packages. For other types of model objects, there may be methods to extract similar information with different functions.

Assuming that *model *has been defined as a model
object by an appropriate function,

summary(model)

produces a summary of the model with estimates of the
coefficients, and sometimes other useful information like a *p*-value for
the model, or an *r-squared* or pseudo *R-squared* for the model.

library(car)

Anova(model)

for some models, will produce an analysis of variance table, or an analysis of deviance table.

plot(model)

will produce plots of the model, usually diagnostic plots.

anova(model1, model2)

will compare two models by an analysis of variance or other test.

predict(model)

will report predicted values in the dependent variable from the model for each observation that went in to the model.

residuals(model)

will report residual values from the model.

str(model)

reports the structure of the information stored in the model object.

### References

King, W.B. Model Formulae. ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html.