### Organizing data: observations and variables

In general, collected raw data is organized according to *observations
*and *variables*. Variables represent a single measurement or
characteristic for each observation. In the Statistics Learning Center video
in the *Required Readings* below, Dr. Nic gives an example of a survey
where each observation is a separate person, and the variables are age, sex,
and chocolate preference for each person.

In education evaluation, observations will commonly be a student or an instructor, but could also be any other experimental unit, such as a single farmer’s field.

In reality, a single observation may be more specific. For example, the rating for one instructor by one student on one date. For example, imagine the following data where each instructor was rated on two different dates, by each of four students.

Each row represents an observation. So each observation
contains the rating by a single student on a single date for a single
instructor. The variables are *Instructor*, *Date*, *Student*,
and *Rating*.

Instructor__Date__ __Student__
__Rating__

Bob Belcher 2015-01-01 a 4

Bob Belcher 2015-01-01 b 5

Bob Belcher 2015-01-01 c 4

Bob Belcher 2015-01-01 d 6

Bob Belcher 2015-02-05 a 6

Bob Belcher 2015-02-05 b 6

Bob Belcher 2015-02-05 c 10

Bob Belcher 2015-02-05 d 6

Linda Belcher 2015-01-01 a 8

Linda Belcher 2015-01-01 b 6

Linda Belcher 2015-01-01 c 8

Linda Belcher 2015-01-01 d 8

Linda Belcher 2015-02-05 a 8

Linda Belcher 2015-02-05 b 7

Linda Belcher 2015-02-05 c 10

Linda Belcher 2015-02-05 d 9

#### Long-format and wide-format data

The example above has ratings for each instructor on each of
two dates and by each of four students. Because each observation has just one
rating value, this data is in *long format*.

In general, it is best to keep data in long format for summary and analyses.

However, to conduct certain analyses in *R*, or to
produce certain plots, data will need to be in wide format. The following is
an example of the same data, translated into wide format.

Instructor__Student__ ——————— Rating
———————

__2015-01-01__ __2015-02-05__

Bob Belcher a 4 6

Bob Belcher b 5 6

Bob Belcher c 4 10

Bob Belcher d 6 6

Linda Belcher a 8 8

Linda Belcher b 6 7

Linda Belcher c 8 10

Linda Belcher d 8 9

### Types of variables

The most common variables used in data analysis can be classified as one of three types of variables: nominal, ordinal, and interval/ratio.

Understanding the differences in these types of variables is critical, since the variable type will determine which statistical analysis will be valid for that data. In addition, the way we summarize data with statistics and plots will be determined by the variable type.

#### Nominal data

Nominal variables are data whose levels are labels or
descriptions, and which cannot be ordered. Examples of nominal variables are *gender*,
*school*, and yes/no questions. They are also called “nominal categorical”
or “qualitative” variables, and the levels of a variable are sometimes called “levels”,
“classes”, or “groups”.

The levels of categorical variables cannot be ordered. For the
variable *gender*, it makes no sense to try to put the levels “female”,
“male”, and “other” in any numerical order. If levels are numbered for
convenience, the numbers are arbitrary, and the variable can’t be treated as a
numeric variable.

#### Ordinal data

Ordinal variables can be ordered, or ranked in logical order, but the intervals between levels of the variables are not necessarily known. Subjective measurements are often ordinal variables. One example would be having people rank four items by preference in order from one to four. A different example would be having people assess several items based on a Likert item: “On a scale of one to five, do you agree or disagree with this statement?” A third example is level of education for adults, considering for example “less than high school”, “high school”, “associate’s degree”, etc.

Critically, in each case we can order the responses: My first favorite salad dressing is better than second favorite, which is better than my third favorite. But we cannot know if the interval between the levels is equal. For example, the distance between your favorite salad dressing and your second favorite salad dressing may be small, where there may be a large gap between your second and third choices.

We can logically assign numbers to levels of an ordinal variable, and can treat them in order, but shouldn’t treat them as numeric: “strongly agree” and “neutral” may not average out to an “agree.”

For the purposes of this book, we will consider such Likert item data to be ordinal data under most circumstances.

Ordinal data is sometimes called “ordered categorical”.

#### Interval/ratio data

Interval/ratio variables are measured or counted values: *age*,
*height*, *weight, number of students*. The interval between numbers
is known to be equal: the interval between one kilogram and two kilograms is
the same as between three kilograms and four kilograms.

Interval/ratio data are also called “quantitative” data.

##### Discrete and continuous variables

A further division of interval/ratio data is between *discrete*
variables, whose values are necessarily whole numbers or other discrete values,
such as counts of items. *Continuous* variables can take on any value
within an interval, and so can be expressed as decimals. They are often
measured quantities. For example, in theory a weight could be measured as 1
kg, 1.01 kg, or 1.009 kg, and so on. Age could also be considered a continuous
variable, though we often treat it as a discrete variable, by rounding it to
the most recent birthday.

##### Optional technical note

There is a technical difference between interval and ratio
data. For interval data, the interval between measurements is the same, but
ratio between measurements is not known. A common case of this is temperature
measured in degrees Fahrenheit. The difference between 5 ° F and 10 ° F is the
same as that between 10 ° F and 15 ° F. But 10 ° F is ** not** twice
5 ° F. This is because the definition of 0 ° F is arbitrary. 0 ° F does not
equal 0 ° C.

Measurements where there is a natural zero, such as length or height, or where a zero can be honestly defined, such as time since an event, are considered ratio data.

For the most part, ratio and interval data are considered together. Just be careful not to make senseless statements with interval data, such as saying, “The mean temperature in Greenhouse 1 was twice the mean temperature of Greenhouse 2.”

#### Levels of measurement

In general it is advantageous to treat variables as the
highest level of measurement for which they qualify. That is, we ** could**
treat education level as a categorical variable, but usually we will want to
treat it as an ordinal variable. This is because treating it as an ordinal
variable retains more of the information carried in the data. If we were to
reduce it to a categorical variable, we would lose the order of the levels of
the variable. By using a higher level of measurement, we will have more options
in the way we analyze, summarize, and present data.

This being said, there may be cases when it is advantageous to treat ordinal or count data as categorical. One case is if there are few levels of the variable, or if it makes sense to condense the variable into a couple of broad categories. Another example of when we choose a lower level of measurement is when we use nonparametric statistical analyses which treat interval/ratio data as ordinal, or ranked, data.

### Types of Variables in R

R does not use the terms *nominal*, *ordinal*, and
*interval/ratio* for types of variables.

In R, nominal variables can be coded as variables with *factor*
or *character* classes.

Colors = c("Red", "Green", "Blue")

class(Colors)

[1] "character"

Colors.f = factor("Red", "Green",
"Blue")

class(Colors.f)

[1] "factor"

Interval/ratio data can be coded as variables with *numeric*
or *integer* classes. An *L* used with values to tell R to store the
data as an integer class.

BugCount = c(1, 2, 3, 4, 5)

class(BugCount)

[1] "numeric"

BugCount.int = c(1L, 2L, 3L, 4L, 5L)

class(BugCount.int)

[1] "integer"

We can code ordinal data as either numeric or factor variables, depending on how we will be summarizing, plotting, and analyzing it.

Note that with the *stringsAsFactors=TRUE* option, *read.table*
will read text input as a factor variable. Also, *read.table* will
determine if a numeric variable should have an integer or numeric class.

Dragons = read.table(header=TRUE, stringsAsFactors=TRUE, text="

Tribe Length.m SizeRank

IceWings 6.4 1

MudWings 6.1 2

SeaWings 5.8 3

SkyWings 5.5 4

NightWings 5.2 5

RainWings 4.9 6

SandWings 4.6 7

")

str(Dragons)

'data.frame': 7 obs. of 3 variables:

$ Tribe : Factor w/ 7 levels "IceWings","MudWings",..: 1
2 6 7 3 4 5

$ Length.m: num 6.4 6.1 5.8 5.5 5.2 4.9 4.6

$ SizeRank: int 1 2 3 4 5 6 7

We can convert the variable *SizeRank* into an ordered
factor variable. This will help us with some types of summary and analysis,
and is helpful in determining the order that the levels of a factor variable
will be plotted.

Dragons$SizeRank.f = factor(Dragons$SizeRank,

ordered = TRUE,

levels = c("1", "2",
"3", "4", "5", "6", "7"))

str(Dragons)

'data.frame': 7 obs. of 4 variables:

$ Tribe : Factor w/ 7 levels "IceWings","MudWings",..:
1 2 6 7 3 4 5

$ Length.m : num 6.4 6.1 5.8 5.5 5.2 4.9 4.6

$ SizeRank : int 1 2 3 4 5 6 7

$ SizeRank.f: Ord.factor w/ 7 levels
"1"<"2"<"3"<"4"<..: 1 2 3 4
5 6 7

sapply(Dragons, class)

$`Tribe`

[1] "factor"

$Length.m

[1] "numeric"

$SizeRank

[1] "integer"

$SizeRank.f

[1] "ordered" "factor"

### References

Rouxzee. 2014. Wings of Fire: How Big are the Dragons? Diviantart.com. (Since deactivated.)

### Required readings

** [Video] “Types of Data: Nominal, Ordinal,
Interval/Ratio”** from Statistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=hZxnzfnt5v8.

### Optional readings

** “Types of biological variables”** in McDonald,
J.H. 2014.

*Handbook of Biological Statistics*. www.biostathandbook.com/variabletypes.html.

* “Frequency, frequency tables, and levels of
measurement”,*** Chapter 1.3 **in Openstax College. 2013.

*Introductory Statistics*. Rice University. openstax.org/textbooks/introductory-statistics.

* “Data basics”, *** Chapter 1.2 **in
Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012.

*OpenIntro Statistics*, 2nd ed. www.openintro.org/.

### Exercises B

1. For the following variables and levels identify the variable as nominal, ordinal, or interval/ratio.

a. Political affiliation (very liberal, liberal, independent, conservative, very conservative)

b. Political affiliation (Democrat, Republican, Green, Libertarian)

c. Student age

d. Six ranked school subject, according which is respondent’s favorite (Science, History, Math, Physical Education, English Literature)

e. How often do you complete your homework (never, sometimes, often, always)

f. Gender identity (1-female, 2-male, 3-other)

g. Level of education (1-elementary school, 2-junior high school, 3-high school, 4-college)

h. Dairy cow weight

i. Number of meals and snacks eaten in a day

j. (Yes, No)

k. (Yes, No, Not applicable)

l. Favorite salad dressing (1^{st}, 2^{nd}, 3^{rd},
4^{th})

m. Ranked salad dressings, according which is respondent’s favorite (Italian, French, Caesar, etc.)

2. Are ** each** of the following terms associated with
nominal, ordinal, or interval/ratio data.

a. Likert item responses

b. Categorical (two correct answers)

c. Continuous

d. Qualitative

e. Count

3. For the following table, identify:

a. Number of observations

b. Number of variables

c. Type of each variable (nominal, ordinal, or interval/ratio)

__Student__ __Grade__ __Age__ __Gender__ __Height__
Calories __Attitude__ Class

__eaten__ __rank__

A 5 10 M 137 2000 5 4

B 5 11 M 140 1500 4 2

C 4 9 F 120 1200 3 5

D 4 10 F 140 1400 4 1

E 6 12 Other 147 1800 5 6

F 5 10 M 135 1600 4 8

G 4 9 M 130 1200 4 3

H 4 10 F 140 1800 3 7

4. For the following table, identify:

a. Number of observations

b. Number of variables

c. Type of each variable (nominal, ordinal, or interval/ratio)

__Farm__ __Town__ __Acreage__ __Product__ __Sustainabilty__
Sustainability __Preserved
__

__code__

A Skara.Brae 21.1 Turnip Very.unsustainable 1 Yes

B Arboria 7.0 Parsnip Sustainable 4 Yes

C Lucencia 32.5 Rutabaga Unsusatainable 2 No

D Tenebrosia 21.0 Daikon Very.sustainable 5 No

E Tarmitia 6.3 Yucca Sustainable 4 Unknown

F Malefia 18.1 Kholrabi Neutral 3 Yes