Dambo
Dambo

Reputation: 3496

How to prevent integer reordering when converting to factor?

This is really more a question on the why and how of this behavior in R.

I have a vector

c("18", "68", "18-20", "22", "27", "16-18", "unkown")

I would expect that if I ran

as.factor(c("18", "68", "18-20", "22", "27", "16-18", "unkown"))

The levels would follow the order of the elements of the vector. Instead, they are ordered as if R tried to interpret the numeric characters in each element:

[1] 18     68     18-20  22     27     16-18  unkown
Levels: 16-18 18 18-20 22 27 68 unkown

I can see how this should happen if the elements were of class character, but practically integer/numeric. But because of more ambiguous formats such as 18-20, I am not sure how R knows to order them. In fact, if I had to transform to factor in two steps (first to integer, and then to factor):

> as.integer(c("18", "68", "18-20", "22", "27", "16-18", "unkown"))
[1] 18 68 NA 22 27 NA NA
Warning message:
NAs introduced by coercion 

Which makes perfect sense because 18-20 is a simple character string.

Upvotes: 1

Views: 240

Answers (1)

Florian
Florian

Reputation: 25435

If no set of levels is supplied, the documentation states that:

levels: an optional vector of the values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x)).

So it has nothing to do with the numeric values, they are sorted as if they were strings. And indeed:

> sort(unique(as.character(c("18", "68", "18-20", "22", "27", "16-18", "unkown"))))
[1] "16-18"  "18"     "18-20"  "22"     "27"     "68"     "unkown"

You can prevent the ordering as follows:

> x=c("18", "68", "18-20", "22", "27", "16-18", "unkown")
> factor(x,levels=unique(x))

[1] 18     68     18-20  22     27     16-18  unkown
Levels: 18 68 18-20 22 27 16-18 unkown

Upvotes: 1

Related Questions