In general, collected raw data is organized according to observations and variables. Variables represent a single measurement or characteristic for each observation. In the Statistics Learning Center video in the Required Readings below, Dr. Nic gives an example of a survey where each observation is a separate person, and the variables are age, sex, and chocolate preference for each person.
In education evaluation, observations will commonly be a student or an instructor, but could also be any other experimental unit, such as a single farmer’s field.
In reality, a single observation may be more specific. For example, the rating for one instructor by one student on one date. For example, here is part of the example from the Friedman Test chapter with some additional hypothetical data.
Each row represents an observation. So each observation contains the rating by a single student on a single date for an instructor. The variables are Instructor, Date, Student, and Rating.
Instructor Date Student Rating
Bob Belcher 2015-01-01 a 4
Bob Belcher 2015-01-01 b 5
Bob Belcher 2015-01-01 c 4
Bob Belcher 2015-01-01 d 6
Bob Belcher 2015-02-05 e 6
Bob Belcher 2015-02-05 f 6
Bob Belcher 2015-02-05 g 10
Bob Belcher 2015-02-05 h 6
Linda Belcher 2015-01-01 a 8
Linda Belcher 2015-01-01 b 6
Linda Belcher 2015-01-01 c 8
Linda Belcher 2015-01-01 d 8
Linda Belcher 2015-02-05 e 8
Linda Belcher 2015-02-05 f 7
Linda Belcher 2015-02-05 g 10
Linda Belcher 2015-02-05 h 9
The example above has ratings for each instructor on each of two dates and by each of eight students. Because each observation has just one rating value, this data is in long format.
In general, it is best to keep data in long format for summary and analyses.
However, to conduct certain analyses in R, or to produce certain plots, data will need to be in wide format. The following is an example of the same data, translated into wide format, with a focus on the ratings by date, and in this case ignoring student.
Instructor ——————— Rating ———————
Bob Belcher 4 6
Bob Belcher 5 6
Bob Belcher 4 10
Bob Belcher 6 6
Linda Belcher 8 8
Linda Belcher 6 7
Linda Belcher 8 10
Linda Belcher 8 9
The most common variables used in data analysis can be classified as one of three types of variables: nominal, ordinal, and interval/ratio.
Understanding the differences in these types of variables is critical, since the variable type will determine which statistical analysis will be valid for that data. In addition, the way we summarize data with statistics and plots will be determined by the variable type.
Nominal variables are data whose levels are labels or descriptions, and which cannot be ordered. Examples of nominal variables are sex, school, and yes/no questions. They are also called “categorical” or “qualitative” variables, and the levels of a variable are sometimes called “classes” or “groups”.
The levels of categorical variables cannot be ordered. For the variable sex, it makes no sense to try to put the levels “female”, “male”, and “other” in any numerical order. If levels are numbered for convenience, the numbers are arbitrary, and the variable can’t be treated as a numeric variable.
Ordinal variables can be ordered, or ranked in logical order, but the interval between levels of the variables are not necessarily known. Subjective measurements are often ordinal variables. One example would be having people rank four items by preference in order from one to four. A different example would be having people assess several items based on a Likert ranking scale: “On a scale of one to five, do you agree or disagree with this statement?” A third example is level of education for adults, considering for example “less than high school”, “high school”, “associate’s degree”, etc.
Critically, in each case we can order the responses: My first favorite salad dressing is better than second favorite, which is better than my third favorite. But we cannot know if the interval between the levels is equal. For example, the distance between your favorite salad dressing and your second favorite salad dressing may be small, where there may be a large gap between your second and third choices.
We can logically assign numbers to levels of an ordinal variable, and can treat them in order, but shouldn’t treat them as numeric: “strongly agree” and “neutral” may not average out to an “agree.”
For the purposes of this book, we will consider such Likert data to be ordinal data under most circumstances.
Interval/ratio variables are measured or counted
values: age, height, weight, number of students.
The interval between numbers is known to be equal: the interval
between one kilogram and two kilograms is the same as between three
kilograms and four kilograms. Interval/ratio data are also called
“quantitative” data, although ordinal data are also quantitative.
Discrete and continuous variables
A further division of interval/ratio data is between discrete variables, whose values are necessarily whole numbers or other discrete values, such as population or counts of items. Continuous variables can take on any value within an interval, and so can be expressed as decimals. They are often measured quantities. For example, in theory a weight could be measured as 1 kg, 1.01 kg, or 1.009 kg, and so on. Age could also be considered a continuous variable, though we often treat it as a discrete variable, by rounding it to the most recent birthday.
Optional technical note
There is a technical difference between interval and ratio data. For interval data, the interval between measurements is the same, but ratio between measurements is not known. A common case of this is temperature measured in degrees Fahrenheit. The difference between 5 ° F and 10 ° F is the same as that between 10 ° F and 15 ° F. But 10 ° F is not twice 5 ° F. This is because the definition of 0 ° F is arbitrary. 0 ° F does not equal 0 ° C.
Measurements where there is a natural zero, such as length or height, or where a zero can be honestly defined, such as time since an event, are considered ratio data.
For the most part, ratio and interval data are considered together. In general, just be careful not to make senseless statements with interval data, such as saying, “The mean temperature in Greenhouse 1 was twice the mean temperature of Greenhouse 2.”
In general it is advantageous to treat variables as the highest level of measurement for which they qualify. That is, we could treat education level as a categorical variable, but usually we will want to treat it as an ordinal variable. This is because treating it as an ordinal variable retains more of the information carried in the data. If we were to reduce it to a categorical variable, we would lose the order of the levels of the variable. By using a higher level of measurement, we will have more options in the way we analyze, summarize, and present data.
This being said, there may be cases when it is advantageous to treat ordinal or count data as categorical. One case is if there are few levels of the variable, or if it makes sense to condense the variable into a couple of broad categories. Another example of when we choose a lower level of measurement is when we use nonparametric statistical analyses which treat interval/ratio data as ordinal, or ranked, data.
[Video] “Types of Data: Nominal, Ordinal, Interval/Ratio” from Statistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=hZxnzfnt5v8.
“Types of biological variables” in McDonald, J.H. 2014. Handbook of Biological Statistics. www.biostathandbook.com/variabletypes.html.
“Frequency, frequency tables, and levels of measurement”, Chapter 1.3 in Openstax College. 2013. Introductory Statistics. Rice University. openstaxcollege.org/textbooks/introductory-statistics.
“Data basics”, Chapter 1.2 in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics, 2nd ed. www.openintro.org/.
1. For the following variables and levels identify the variable as nominal, ordinal, or interval/ratio.
a. Political affiliation (very liberal, liberal, independent, conservative, very conservative)
b. Political affiliation (Democrat, Republican, Green, Libertarian)
c. Student age
d. Favorite school subject (Science, History, Math, Physical Education, English Literature)
e. How often do you complete your homework (never, sometimes, often, always)
f. Gender identity (1-female, 2-male, 3-other)
g. Level of education (1-elementary school, 2-junior high school, 3-high school, 4-college)
h. Dairy cow weight
i. Number of meals and snacks eaten in a day
j. (Yes, No)
k. (Yes, No, Not applicable)
l. Favorite salad dressing (1st, 2nd, 3rd, 4th)
m. Favorite salad dressing (Italian, French, Caesar, etc.)
2. Are each of the following terms associated with nominal, ordinal, or interval/ratio data.
3. For the following table, identify:
a. Number of observations
b. Number of variables
c. Type of each variable (nominal, ordinal, or interval/ratio)
Student Grade Age Sex Height Calories Attitude Class
A 5 10 M 137 2000 5 4
B 5 11 M 140 1500 4 2
C 4 9 F 120 1200 3 5
D 4 10 F 140 1400 4 1
E 6 12 Other 147 1800 5 6
F 5 10 M 135 1600 4 8
G 4 9 M 130 1200 4 3
H 4 10 F 140 1800 3 7