leian
leian

Reputation: 473

factor() command in R is for categorical variables with hierarchy level only?

I'm quite confused on when to use

factor(educ) or factor(agegroup)
in R. Is it used for categorical ordered data? or can I just use to it a simple categorical data with no hierarchy?

I know this is so basic. I really need some clarification.

Upvotes: 1

Views: 24137

Answers (2)

Josh O'Brien
Josh O'Brien

Reputation: 162341

You can flag a factor as ordered by creating it with ordered(x) or with factor(x, ordered=TRUE). The "Details" section of ?factor explains that:

Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently.

You can confirm the first part of that quote (that they differ only in their class) by comparing the attributes of these two objects:

f  <- factor(letters[3:1], levels=letters[3:1])
of <- ordered(letters[3:1], levels=letters[3:1])
attributes(f)
# $levels
# [1] "c" "b" "a"
# 
# $class
# [1] "factor"
attributes(of)
# $levels
# [1] "c" "b" "a"
# 
# $class
# [1] "ordered" "factor" 

Various factor-handling R functions (the "methods and model-fitting functions" of the second part of that quote) will then use is.ordered() to test for the presence of that "ordered" class indicator, taking it as a directive to treat an ordered factor differently than an unordered one. Here are a couple of examples:

## The print method for factors. (Type 'print.factor' to see the function's code)
print(f)
# [1] c b a
# Levels: c b a
print(of)
# [1] c b a
# Levels: c < b < a

## The contrasts function. (Type 'contrasts' to see the function's code.)
contrasts(of)
#                 .L         .Q
# [1,] -7.071068e-01  0.4082483
# [2,]  4.350720e-18 -0.8164966
# [3,]  7.071068e-01  0.4082483
contrasts(f)
#   b a
# c 0 0
# b 1 0
# a 0 1

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193537

I don't really see a clear question here, so perhaps a simple example would suffice as an answer.

Imagine we have the following data.

set1 <- c("AA", "B", "BA", "CC", "CA", "AA", "BA", "CC", "CC")

We want to factor this data.

f.set1 <- factor(set1)

Let's look at the output. Note that R has just alphabetized the levels, but does not say that this implies hierarchy (see the "levels" line).

f.set1
# [1] AA B  BA CC CA AA BA CC CC
# Levels: AA B BA CA CC
is.ordered(f.set1)
# [1] FALSE

However, using as.numeric on the factored data might fool you into thinking it is hierarchical. Note that "5" comes before "4" in the output below, and note also the alphabetized output of table(f.set1) (which also happens if you simply did table(set1).

as.numeric(f.set1)
# [1] 1 2 3 5 4 1 3 5 5
table(f.set1)
# f.set1
# AA  B BA CA CC 
#  2  1  2  1  3 

Let's now compare this with what happens when we use the ordered argument along with the levels argument. Using levels plus ordered = TRUE tells us that this categorical data is hierarchical, in the order specified by levels (not alphabetically or in the order that we've entered the data).

o.set1 <- factor(set1, 
                 levels = c("CA", "BA", "AA", "CC", "B"), 
                 ordered = TRUE)

Even viewing the output shows us hierarchy now.

o.set1
# [1] AA B  BA CC CA AA BA CC CC
# Levels: CA < BA < AA < CC < B
is.ordered(o.set1)
# [1] TRUE

As do the functions as.numeric and table.

as.numeric(o.set1)
# [1] 3 5 2 4 1 3 2 4 4
table(o.set1)
# o.set1
# CA BA AA CC  B 
#  1  2  2  3  1

So, to summarize, factor() by itself just creates essentially a non-hierarchical sorted factor of your categorical data; factor() with the levels and ordered = TRUE arguments create hierarchical categories.

Alternatively, use ordered() if you directly want to create ordered factors. The order of the categories still need to be specified:

ordered(set1, levels = c("CA", "BA", "AA", "CC", "B"))

Upvotes: 6

Related Questions