[banner]

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Converting Numeric Data to Categories

 

There are occasions when it is useful to categorize Likert scores, Likert scales, or continuous data into groups or categories.  In general, there are no universal rules for converting numeric data to categories.  A few methods are presented here.

 

Categorizing data by a range of values

 

One approach is to create categories according to logical cut-off values in the scores or measured values.  An example of this is the common grading system in the U.S. in which a 90% grade or better is an “A”, 80–89% is “B”, etc.  It is common in this approach to make the categories with equal spread in values.  For example, there is a 10 point spread in a “B” grade and a 10 point spread in a “C” grade.  But this equality is not required.  Below in the “Categorize data by range of values” example, 5-point Likert data are converted into categories with 4 and 5 being “High”, 3 being “Medium”, and 1 and 2 being “Low”.

 

This approach relies on the chosen cut-off points being meaningful.  For example, for a grade of 70–79 to be considered “sufficient”, the evaluation instruments (e.g. the tests, quizzes, and assignments) need to be calibrated so that a grade of 75 really is “sufficient”, and not “excellent” or “good”, etc.   You can imagine a case where a 4 or 5 on a 5-point Likert item is considered “high”, but if all respondents scored 4 or 5 on the item, it might not be clear that these values are “high”, but may just be the typical response.  Likewise, in this case, the decision to group 4 and 5 as “high” needs be compared with a decision to group 4 with 2 and 3 as “medium”.  This breakdown may be closer to how people interpret a 5-point Likert scale.  Either grouping could be meaningful depending on the purpose of the categorization and the interpretation of each level in the Likert item.

 

Categorizing data by percentiles

 

A second approach is to use percentiles to categorize data.  Remember, the value for the 90th percentile is the score or measurement below which 90% of respondents scored. 

 

The advantage to this approach is that it does not rely on the scoring system being meaningful in its absolute values.  That is, students scoring above the 90th percentile are scoring higher than 90% of students.  It doesn’t matter if the score for the 90th percentile is 90 out of 100 or 50 out of 100.  If the groups use equally-spaced breakpoints, for example 0th, 25th, 50th, 75th, and 100th percentiles, there should be approximately an equal number of respondents in each category.

 

Categorizing data with clustering

 

A third approach is to use a clustering algorithm to divide data into groups with similar measurements.  This is useful when there are multiple measurements for an individual.  For example, if students receive scores on reading, math, and analytical thinking, an algorithm can determine if, for example, there is a group of students who do well on all three measurements, and a group of students who do well in math and analysis but who do poorly on reading.  Most clustering algorithms assume that measurement data is continuous in nature.  Below, the partitioning around medoids (PAM) method is used with the manhattan metric.  This may be relatively more suitable for ordinal data than some other methods

 

Packages used in this chapter

 

The packages used in this chapter include:

•  psych

•  cluster

•  fpc

 

The following commands will install these packages if they are not already installed:


if(!require(psych)){install.packages("psych")}
if(!require(cluster)){install.packages("cluster")}
if(!require(fpc)){install.packages("fpc")}


Examples for converting numeric data to categories


Input =("
Instructor  Student  Likert
 Homer      a        3
 Homer      b        4
 Homer      c        4
 Homer      d        4
 Homer      e        4
 Homer      f        5
 Homer      g        5
 Homer      h        5
 Homer      i        3
 Homer      j        2
 Homer      k        3
 Homer      l        4
 Homer      m        5
 Homer      n        5
 Homer      o        5
 Homer      p        4
 Homer      q        4
 Homer      r        3
 Homer      s        2
 Homer      t        5
 Homer      u        3
")

Data = read.table(textConnection(Input),header=TRUE)


###  Check the data frame


library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Categorize data by range of values

The following example will categorize responses on a single 5-point Likert item.   The algorithm will be a score of 1 or 2 will be called “low”, a score of 3 “medium”, and a score of 4 or 5 “high”.

 

Categorize data


Data$Category[Data$Likert == 1 | Data$Likert == 2] = "Low"
Data$Category[Data$Likert == 3                   ] = "Medium"
Data$Category[Data$Likert == 4 | Data$Likert == 5] = "High"

Data


Order factor levels to make output easier to read


Data$Category = factor(Data$Category,
                    levels=c("Low", "Medium", "High"))


Summarize counts of categories


XT = xtabs(~ Category + Instructor,
           data = Data)

XT


        Instructor
Category Homer
  Low        2
  Medium     5
  High      14


Report students in each category


Data$Student[Data$Category == "Low"]


[1] j s


Data$Student[Data$Category == "Medium"]


[1] a i k r u


Data$Student[Data$Category == "High"]


[1] b c d e f g h l m n o p q t


Alternate method with tapply


tapply(X     = Data$Student,
       INDEX = Data$Category,
       FUN   = print)


$Low
[1] j s

$Medium
[1] a i k r u

$High
 [1] b c d e f g h l m n o p q t


Summary table


Category   Range     Count  Students
Low        1 or 2     2     j, s
Medium     3          5     a, i, k, r, u
High       4 or 5    14     b, c, d, e, f, g, h, l, m, n, o, p, q, t


Categorize data by percentile

The following example will categorize responses on a single 5-point Likert item.  Respondents scoring below the 33rd percentile will be labeled “Lower third”; those between the 33rd and 67th percentile “Middle third”; and those above the 67th percentile “Upper third”.

 

Categorize data


Percentile_00  = min(Data$Likert)
Percentile_33  = quantile(Data$Likert, 0.33333)
Percentile_67  = quantile(Data$Likert, 0.66667)
Percentile_100 = max(Data$Likert)


RB = rbind(Percentile_00, Percentile_33, Percentile_67, Percentile_100)

dimnames(RB)[[2]] = "Value"

RB


                Value
Percentile_00  2.0000
Percentile_33  3.6666
Percentile_67  4.3334
Percentile_100 5.0000


Data$Group[Data$Likert >= Percentile_00 & Data$Likert <  Percentile_33]  = "Lower_third"
Data$Group[Data$Likert >= Percentile_33 & Data$Likert <  Percentile_67]  = "Middle_third"
Data$Group[Data$Likert >= Percentile_67 & Data$Likert <= Percentile_100] = "Upper_third"

Data


Order factor levels to make output easier to read


Data$Group = factor(Data$Group,
                    levels=c("Lower_third", "Middle_third", "Upper_third"))


Summarize counts of groups


XT = xtabs(~ Group + Instructor,
           data = Data)

XT


              Instructor
Group          Homer
  Lower_third      7
  Middle_third     7
  Upper_third      7


Report students in each group


 tapply(X     = Data$Student,
        INDEX = Data$Group,
        FUN   = print)


$Lower_third
[1] a i j k r s u

$Middle_third
[1] b c d e l p q

$Upper_third
[1] f g h m n o t


Summary table


Group         Range       Count  Students
Lower third   1, 2, or 3  7      a, i, j, k, r, s, u
Middle third  4           7      b, c, d, e, l, p, q
Upper third   5           7      f, g, h, m, n, o, t


Categorize data by clustering

In the following example, each student has a score for Happy and for Tired.  The students will be divided into clusters based on the similarities of their scores across both measures.

 

Example data


Input =("
Instructor  Student  Happy   Tired
Marge       a        5       5
Marge       b        5       5
Marge       c        2       5
Marge       d        2       5
Marge       e        5       5
Marge       f        5       5
Marge       g        1       5
Marge       h        1       5
Marge       i        1       5
Marge       j        5       5
Marge       k        3       3
Marge       l        3       3
Marge       m        3       3
Marge       n        5       2
Marge       o        5       2
Marge       p        5       1
Marge       q        5       1
Marge       r        5       1
Marge       s        4       1
Marge       t        4       1
")

Data = read.table(textConnection(Input),header=TRUE)


###  Check the data frame


library(psych)

headTail(Data)

str(Data)

summary(Data)


### Remove unnecessary objects

rm(Input)


Plot data

In the following plot, each letter represents a Student from the data frame.  To me, this plot suggests that the data could be reasonably clustered into 4 or perhaps 7 or 8 clusters.


plot(jitter(Happy) ~ jitter(Tired),
     data = Data,
     pch=as.character(Data$Student))





Use only numeric data


Data.num = Data[c("Happy", "Tired")]


Determine the optimal number of clusters

The pamk function in the fpc package can determine the optimum number of clusters for the partitioning around medoids (PAM) method.  It will also complete the PAM analysis, but we’ll do that separately below. 

 

Practical considerations may override the results of the pamk function.  In the example below, if we include 1 in the possible range of cluster numbers, the function will determine that 1 is the optimum number.  Also, if we extend the range to, say, 10, the function will choose 7 as the optimum number, but this may be too many for our purposes.


library(fpc)

PAMK = pamk(Data.num,
            krange = 2:5,
            metric="manhattan")

PAMK$nc


[1] 4

   ###  This is the optimum number of clusters in the range


plot(PAMK$crit)
lines(PAMK$crit)




For the range of cluster numbers shown on the x-axis, the crit value is maximized at 4, suggesting 4 is optimum number of clusters for this range.


Categorize data

We will use the pam function in the cluster package to divide our data into four clusters.


library(cluster)

PAM = pam(x = Data.num,
          k = 4,              ### Number of clusters to find
          metric="manhattan")

PAM


Medoids:
     ID Happy Tired
[1,] 10     5     5
[2,]  9     1     5
[3,] 13     3     3
[4,] 16     5     1

Clustering vector:
 [1] 1 1 2 2 1 1 2 2 2 1 3 3 3 4 4 4 4 4 4 4


### Add clusters to data frame

PAMClust = rep("NA", length(Data$Likert))

PAMClust[PAM$clustering == 1] = "Cluster 1"
PAMClust[PAM$clustering == 2] = "Cluster 2"
PAMClust[PAM$clustering == 3] = "Cluster 3"
PAMClust[PAM$clustering == 4] = "Cluster 4"


Data$Cluster = PAMClust

Data


Order factor levels to make output easier to read


Data$Cluster = factor(Data$Cluster,
                      levels=c("Cluster 1", "Cluster 2",
                               "Cluster 3", "Cluster 4"))


Summarize counts of groups


XT = xtabs(~ Cluster + Instructor,
           data = Data)

XT


           Instructor
Cluster     Marge
  Cluster 1     5
  Cluster 2     5
  Cluster 3     3
  Cluster 4     7


Report students in each group


 tapply(X     = Data$Student,
        INDEX = Data$Cluster,
        FUN   = print)


$`Cluster 1`
[1] a b e f j

$`Cluster 2`
[1] c d g h i

$`Cluster 3`
[1] k l m

$`Cluster 4`
[1] n o p q r s t


Summary table


Cluster         Interpretation       Count  Students
Cluster 1     Tired and happy      5      a, b, e, f, j
Cluster 2     Tired and not happy  5      c, d, d, h, i
Cluster 3     Middle of the road   3      k, l, m
Cluster 4     Not tired and happy  7      n, o, p, q, r, s, t


Final plot


ggplot(Data,
       aes(x     = Tired,
           y     = Happy,
           color = Cluster)) +
   geom_point(size=3) +
   geom_jitter(width = 0.4, height = 0.4) +
   theme_bw()