R Handbook: Types of Variables

Organizing data: observations and variables

In general, collected raw data is organized according to observations and variables. Variables represent a single measurement or characteristic for each observation. In the Statistics Learning Center video in the Required Readings below, Dr. Nic gives an example of a survey where each observation is a separate person, and the variables are age, sex, and chocolate preference for each person.

In education evaluation, observations will commonly be a student or an instructor, but could also be any other experimental unit, such as a single farmer’s field.

In reality, a single observation may be more specific. For example, the rating for one instructor by one student on one date. For example, imagine the following data where each instructor was rated on two different dates, by each of four students.

Each row represents an observation. So each observation contains the rating by a single student on a single date for a single instructor. The variables are Instructor, Date, Student, and Rating.

Instructor     Date        Student Rating
Bob Belcher    2015-01-01 a         4
Bob Belcher    2015-01-01 b         5
Bob Belcher    2015-01-01 c         4
Bob Belcher    2015-01-01 d         6
Bob Belcher    2015-02-05 a         6
Bob Belcher    2015-02-05 b         6
Bob Belcher    2015-02-05 c        10
Bob Belcher    2015-02-05 d         6
Linda Belcher 2015-01-01 a         8
Linda Belcher 2015-01-01 b         6
Linda Belcher 2015-01-01 c         8
Linda Belcher 2015-01-01 d         8
Linda Belcher 2015-02-05 a         8
Linda Belcher 2015-02-05 b         7
Linda Belcher 2015-02-05 c        10
Linda Belcher 2015-02-05 d         9

Long-format and wide-format data

The example above has ratings for each instructor on each of two dates and by each of four students. Because each observation has just one rating value, this data is in long format.

In general, it is best to keep data in long format for summary and analyses.

However, to conduct certain analyses in R, or to produce certain plots, data will need to be in wide format. The following is an example of the same data, translated into wide format.

Instructor     Student ——————— Rating ———————
                        2015-01-01 2015-02-05
Bob Belcher   a        4           6
Bob Belcher   b        5           6
Bob Belcher   c        4          10
Bob Belcher   d        6           6
Linda Belcher a         8           8
Linda Belcher b         6           7
Linda Belcher c         8          10
Linda Belcher d         8           9

Types of variables

The most common variables used in data analysis can be classified as one of three types of variables: nominal, ordinal, and interval/ratio.

Understanding the differences in these types of variables is critical, since the variable type will determine which statistical analysis will be valid for that data. In addition, the way we summarize data with statistics and plots will be determined by the variable type.

Nominal data

Nominal variables are data whose levels are labels or descriptions, and which cannot be ordered. Examples of nominal variables are gender, school, and yes/no questions. They are also called “nominal categorical” or “qualitative” variables, and the levels of a variable are sometimes called “levels”, “classes”, or “groups”.

The levels of categorical variables cannot be ordered. For the variable gender, it makes no sense to try to put the levels “female”, “male”, and “other” in any numerical order. If levels are numbered for convenience, the numbers are arbitrary, and the variable can’t be treated as a numeric variable.

Ordinal data

Ordinal variables can be ordered, or ranked in logical order, but the intervals between levels of the variables are not necessarily known. Subjective measurements are often ordinal variables. One example would be having people rank four items by preference in order from one to four. A different example would be having people assess several items based on a Likert item: “On a scale of one to five, do you agree or disagree with this statement?” A third example is level of education for adults, considering for example “less than high school”, “high school”, “associate’s degree”, etc.

Critically, in each case we can order the responses: My first favorite salad dressing is better than second favorite, which is better than my third favorite. But we cannot know if the interval between the levels is equal. For example, the distance between your favorite salad dressing and your second favorite salad dressing may be small, where there may be a large gap between your second and third choices.

We can logically assign numbers to levels of an ordinal variable, and can treat them in order, but shouldn’t treat them as numeric: “strongly agree” and “neutral” may not average out to an “agree.”

For the purposes of this book, we will consider such Likert item data to be ordinal data under most circumstances.

Ordinal data is sometimes called “ordered categorical”.

Interval/ratio data

Interval/ratio variables are measured or counted values: age, height, weight, number of students. The interval between numbers is known to be equal: the interval between one kilogram and two kilograms is the same as between three kilograms and four kilograms.

Interval/ratio data are also called “quantitative” data.

Discrete and continuous variables

A further division of interval/ratio data is between discrete variables, whose values are necessarily whole numbers or other discrete values, such as counts of items. Continuous variables can take on any value within an interval, and so can be expressed as decimals. They are often measured quantities. For example, in theory a weight could be measured as 1 kg, 1.01 kg, or 1.009 kg, and so on. Age could also be considered a continuous variable, though we often treat it as a discrete variable, by rounding it to the most recent birthday.

Optional technical note

There is a technical difference between interval and ratio data. For interval data, the interval between measurements is the same, but ratio between measurements is not known. A common case of this is temperature measured in degrees Fahrenheit. The difference between 5 ° F and 10 ° F is the same as that between 10 ° F and 15 ° F. But 10 ° F is not twice 5 ° F. This is because the definition of 0 ° F is arbitrary. 0 ° F does not equal 0 ° C.

Measurements where there is a natural zero, such as length or height, or where a zero can be honestly defined, such as time since an event, are considered ratio data.

For the most part, ratio and interval data are considered together. Just be careful not to make senseless statements with interval data, such as saying, “The mean temperature in Greenhouse 1 was twice the mean temperature of Greenhouse 2.”

Levels of measurement

In general it is advantageous to treat variables as the highest level of measurement for which they qualify. That is, we could treat education level as a categorical variable, but usually we will want to treat it as an ordinal variable. This is because treating it as an ordinal variable retains more of the information carried in the data. If we were to reduce it to a categorical variable, we would lose the order of the levels of the variable. By using a higher level of measurement, we will have more options in the way we analyze, summarize, and present data.

This being said, there may be cases when it is advantageous to treat ordinal or count data as categorical. One case is if there are few levels of the variable, or if it makes sense to condense the variable into a couple of broad categories. Another example of when we choose a lower level of measurement is when we use nonparametric statistical analyses which treat interval/ratio data as ordinal, or ranked, data.

Types of Variables in R

R does not use the terms nominal, ordinal, and interval/ratio for types of variables.

In R, nominal variables can be coded as variables with factor or character classes.

Colors = c("Red", "Green", "Blue")

class(Colors)

[1] "character"

Colors.f = factor("Red", "Green", "Blue")

class(Colors.f)

[1] "factor"

Interval/ratio data can be coded as variables with numeric or integer classes. An L used with values to tell R to store the data as an integer class.

BugCount = c(1, 2, 3, 4, 5)

class(BugCount)

[1] "numeric"

BugCount.int = c(1L, 2L, 3L, 4L, 5L)

class(BugCount.int)

[1] "integer"

We can code ordinal data as either numeric or factor variables, depending on how we will be summarizing, plotting, and analyzing it.

Note that with the stringsAsFactors=TRUE option, read.table will read text input as a factor variable. Also, read.table will determine if a numeric variable should have an integer or numeric class.

Dragons = read.table(header=TRUE, stringsAsFactors=TRUE, text="

Tribe       Length.m SizeRank
IceWings    6.4       1
MudWings    6.1       2
SeaWings    5.8       3
SkyWings    5.5       4
NightWings 5.2       5
RainWings   4.9       6
SandWings   4.6       7
")

str(Dragons)

'data.frame': 7 obs. of 3 variables:
$ Tribe : Factor w/ 7 levels "IceWings","MudWings",..: 1 2 6 7 3 4 5
$ Length.m: num 6.4 6.1 5.8 5.5 5.2 4.9 4.6
$ SizeRank: int 1 2 3 4 5 6 7

We can convert the variable SizeRank into an ordered factor variable. This will help us with some types of summary and analysis, and is helpful in determining the order that the levels of a factor variable will be plotted.

Dragons$SizeRank.f = factor(Dragons$SizeRank,
ordered = TRUE,
levels = c("1", "2", "3", "4", "5", "6", "7"))

str(Dragons)

'data.frame': 7 obs. of 4 variables:
$ Tribe : Factor w/ 7 levels "IceWings","MudWings",..: 1 2 6 7 3 4 5
$ Length.m : num 6.4 6.1 5.8 5.5 5.2 4.9 4.6
$ SizeRank : int 1 2 3 4 5 6 7
$ SizeRank.f: Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: 1 2 3 4 5 6 7

sapply(Dragons, class)

$`Tribe`
[1] "factor"

$Length.m
[1] "numeric"

$SizeRank
[1] "integer"

$SizeRank.f
[1] "ordered" "factor"

References

Rouxzee. 2014. Wings of Fire: How Big are the Dragons? Diviantart.com. (Since deactivated.)

Required readings

[Video] “Types of Data: Nominal, Ordinal, Interval/Ratio” from Statistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=hZxnzfnt5v8.

Optional readings

“Types of biological variables” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/variabletypes.html.

“Frequency, frequency tables, and levels of measurement”, Chapter 1.3 in Openstax College. 2013. Introductory Statistics. Rice University. openstax.org/textbooks/introductory-statistics.

“Data basics”, Chapter 1.2 in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics, 2nd ed. www.openintro.org/.

Exercises B

1. For the following variables and levels identify the variable as nominal, ordinal, or interval/ratio.

a. Political affiliation (very liberal, liberal, independent, conservative, very conservative)

b. Political affiliation (Democrat, Republican, Green, Libertarian)

c. Student age

d. Six ranked school subject, according which is respondent’s favorite (Science, History, Math, Physical Education, English Literature)

e. How often do you complete your homework (never, sometimes, often, always)

f. Gender identity (1-female, 2-male, 3-other)

g. Level of education (1-elementary school, 2-junior high school, 3-high school, 4-college)

h. Dairy cow weight

i. Number of meals and snacks eaten in a day

j. (Yes, No)

k. (Yes, No, Not applicable)

l. Favorite salad dressing (1^st, 2^nd, 3^rd, 4^th)

m. Ranked salad dressings, according which is respondent’s favorite (Italian, French, Caesar, etc.)

2. Are each of the following terms associated with nominal, ordinal, or interval/ratio data.

a. Likert item responses

b. Categorical (two correct answers)

c. Continuous

d. Qualitative

e. Count

3. For the following table, identify:

a. Number of observations

b. Number of variables

c. Type of each variable (nominal, ordinal, or interval/ratio)

Student Grade   Age     Gender     Height Calories Attitude Class
                                            eaten               rank
A        5       10      M          137     2000      5         4
B        5       11      M          140     1500      4         2
C        4       9       F          120     1200      3         5
D        4       10      F          140     1400      4         1
E        6       12      Other     147     1800      5         6
F        5       10      M          135     1600      4         8
G        4       9       M          130     1200     4         3
H        4       10      F          140     1800      3         7

4. For the following table, identify:

a. Number of observations

b. Number of variables

c. Type of each variable (nominal, ordinal, or interval/ratio)

Farm Town        Acreage Product    Sustainabilty     Sustainability Preserved
                                                         code
A     Skara.Brae 21.1    Turnip     Very.unsustainable 1               Yes
B    Arboria     7.0     Parsnip   Sustainable         4               Yes
C    Lucencia    32.5     Rutabaga   Unsusatainable      2               No
D     Tenebrosia 21.0    Daikon     Very.sustainable    5               No
E    Tarmitia     6.3     Yucca      Sustainable         4               Unknown
F     Malefia    18.1     Kholrabi   Neutral             3               Yes

Summary and Analysis of Extension Program Evaluation in R

Types of Variables

Discrete and continuous variables

Optional technical note