### Organizing data: observations and variables

In general, collected raw data is organized according
to *observations *and *variables*. Variables represent
a single measurement or characteristic for each observation. In
the Statistics Learning Center video in the *Required Readings*
below, Dr. Nic gives an example of a survey where each observation is
a separate person, and the variables are age, sex, and chocolate preference
for each person.

In education evaluation, observations will commonly be a student or an instructor, but could also be any other experimental unit, such as a single farmer’s field.

In reality, a single observation may be more specific.
For example, the rating for one instructor by one student on one date.
For example, here is part of the example from the *Friedman Test*
chapter with some additional hypothetical data.

Each row represents an observation. So each
observation contains the rating by a single student on a single date
for an instructor. The variables are *Instructor*, *Date*,
*Student*, and *Rating*.

__Instructor__ __Date__ __Student__
__Rating__

Bob Belcher 2015-01-01 a 4

Bob Belcher 2015-01-01 b 5

Bob Belcher 2015-01-01 c 4

Bob Belcher 2015-01-01 d 6

Bob Belcher 2015-02-05 e 6

Bob Belcher 2015-02-05 f 6

Bob Belcher 2015-02-05 g 10

Bob Belcher 2015-02-05 h 6

Linda Belcher 2015-01-01 a 8

Linda Belcher 2015-01-01 b 6

Linda Belcher 2015-01-01 c 8

Linda Belcher 2015-01-01 d 8

Linda Belcher 2015-02-05 e 8

Linda Belcher 2015-02-05 f 7

Linda Belcher 2015-02-05 g 10

Linda Belcher 2015-02-05 h 9

#### Long-format and wide-format data

The example above has ratings for each instructor
on each of two dates and by each of eight students. Because each
observation has just one rating value, this data is in *long format*.

In general, it is best to keep data in long format for summary and analyses.

However, to conduct certain analyses in *R*,
or to produce certain plots, data will need to be in wide format.
The following is an example of the same data, translated into wide format,
with a focus on the ratings by date, and in this case ignoring student.

__Instructor__ ———————
Rating ———————

__2015-01-01__ __2015-02-05__

Bob Belcher
4
6

Bob Belcher 5
6

Bob Belcher 4
10

Bob Belcher 6
6

Linda Belcher 8
8

Linda Belcher 6
7

Linda Belcher 8
10

Linda Belcher 8
9

### Types of variables

The most common variables used in data analysis can be classified as one of three types of variables: nominal, ordinal, and interval/ratio.

Understanding the differences in these types of variables is critical, since the variable type will determine which statistical analysis will be valid for that data. In addition, the way we summarize data with statistics and plots will be determined by the variable type.

#### Nominal data

Nominal variables are data whose levels are labels
or descriptions, and which cannot be ordered. Examples of nominal
variables are *sex*, *school*, and yes/no questions.
They are also called “categorical” or “qualitative” variables, and the
levels of a variable are sometimes called “classes” or “groups”.

The levels of categorical variables cannot be ordered.
For the variable *sex*, it makes no sense to try to put the levels
“female”, “male”, and “other” in any numerical order. If levels
are numbered for convenience, the numbers are arbitrary, and the variable
can’t be treated as a numeric variable.

#### Ordinal data

Ordinal variables can be ordered, or ranked in logical order, but the interval between levels of the variables are not necessarily known. Subjective measurements are often ordinal variables. One example would be having people rank four items by preference in order from one to four. A different example would be having people assess several items based on a Likert ranking scale: “On a scale of one to five, do you agree or disagree with this statement?” A third example is level of education for adults, considering for example “less than high school”, “high school”, “associate’s degree”, etc.

Critically, in each case we can order the responses: My first favorite salad dressing is better than second favorite, which is better than my third favorite. But we cannot know if the interval between the levels is equal. For example, the distance between your favorite salad dressing and your second favorite salad dressing may be small, where there may be a large gap between your second and third choices.

We can logically assign numbers to levels of an ordinal variable, and can treat them in order, but shouldn’t treat them as numeric: “strongly agree” and “neutral” may not average out to an “agree.”

For the purposes of this book, we will consider such Likert data to be ordinal data under most circumstances.

#### Interval/ratio data

Interval/ratio variables are measured or counted
values: *age*, *height*, *weight, number of students*.
The interval between numbers is known to be equal: the interval
between one kilogram and two kilograms is the same as between three
kilograms and four kilograms. Interval/ratio data are also called
“quantitative” data, although ordinal data are also quantitative.

##### Discrete and continuous variables

A further division of interval/ratio data is between
*discrete* variables, whose values are necessarily whole numbers
or other discrete values, such as population or counts of items.
*Continuous* variables can take on any value within an interval,
and so can be expressed as decimals. They are often measured quantities.
For example, in theory a weight could be measured as 1 kg, 1.01 kg,
or 1.009 kg, and so on. Age could also be considered a continuous
variable, though we often treat it as a discrete variable, by rounding
it to the most recent birthday.

##### Optional technical note

There is a technical difference between interval
and ratio data. For interval data, the interval between measurements
is the same, but ratio between measurements is not known. A common
case of this is temperature measured in degrees Fahrenheit. The
difference between 5 ° F and 10 ° F is the same as that between 10 °
F and 15 ° F. But 10 ° F is ** not** twice 5 ° F.
This is because the definition of 0 ° F is arbitrary. 0 ° F does
not equal 0 ° C.

Measurements where there is a natural zero, such as length or height, or where a zero can be honestly defined, such as time since an event, are considered ratio data.

For the most part, ratio and interval data are considered together. In general, just be careful not to make senseless statements with interval data, such as saying, “The mean temperature in Greenhouse 1 was twice the mean temperature of Greenhouse 2.”

#### Levels of measurement

In general it is advantageous to treat variables
as the highest level of measurement for which they qualify. That
is, we ** could** treat education level as a categorical variable,
but usually we will want to treat it as an ordinal variable. This
is because treating it as an ordinal variable retains more of the information
carried in the data. If we were to reduce it to a categorical
variable, we would lose the order of the levels of the variable.
By using a higher level of measurement, we will have more options in
the way we analyze, summarize, and present data.

This being said, there may be cases when it is advantageous to treat ordinal or count data as categorical. One case is if there are few levels of the variable, or if it makes sense to condense the variable into a couple of broad categories. Another example of when we choose a lower level of measurement is when we use nonparametric statistical analyses which treat interval/ratio data as ordinal, or ranked, data.

### Required readings

** [Video] “Types of Data: Nominal, Ordinal,
Interval/Ratio”** from Statistics Learning Center (Dr. Nic). 2011.
www.youtube.com/watch?v=hZxnzfnt5v8.

### Optional readings

** “Types of biological variables”** in
McDonald, J.H. 2014.

*Handbook of Biological Statistics*. www.biostathandbook.com/variabletypes.html.

* “Frequency, frequency tables, and levels
of measurement”,*** Chapter 1.3 **in Openstax College.
2013.

*Introductory Statistics*. Rice University. openstaxcollege.org/textbooks/introductory-statistics.

* “Data basics”, *** Chapter
1.2 **in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012.

*OpenIntro Statistics*, 2nd ed. www.openintro.org/.

### Exercises B

1. For the following variables and levels identify the variable as nominal, ordinal, or interval/ratio.

a. Political affiliation (very liberal, liberal, independent, conservative, very conservative)

b. Political affiliation (Democrat, Republican, Green, Libertarian)

c. Student age

d. Favorite school subject (Science, History, Math, Physical Education, English Literature)

e. How often do you complete your homework (never, sometimes, often, always)

f. Gender identity (1-female, 2-male, 3-other)

g. Level of education (1-elementary school, 2-junior high school, 3-high school, 4-college)

h. Dairy cow weight

i. Number of meals and snacks eaten in a day

j. (Yes, No)

k. (Yes, No, Not applicable)

l. Favorite salad dressing (1^{st}, 2^{nd}, 3^{rd},
4^{th})

m. Favorite salad dressing (Italian, French, Caesar, etc.)

2. Are ** each** of the following terms associated with
nominal, ordinal, or interval/ratio data.

a. Likert

b. Categorical

c. Continuous

d. Qualitative

e. Discrete

3. For the following table, identify:

a. Number of observations

b. Number of variables

c. Type of each variable (nominal, ordinal, or interval/ratio)

__Student__ __Grade__ __Age__ __Sex__ __Height__ Calories
__Attitude__ Class

__eaten__ __rank__

A 5 10 M 137 2000 5 4

B 5 11 M 140 1500 4 2

C 4 9 F 120 1200 3 5

D 4 10 F 140 1400 4 1

E 6 12 Other 147 1800 5 6

F 5 10 M 135 1600 4 8

G 4 9 M 130 1200 4 3

H 4 10 F 140 1800 3 7