Reputation: 3496
This is really more a question on the why and how of this behavior in R.
I have a vector
c("18", "68", "18-20", "22", "27", "16-18", "unkown")
I would expect that if I ran
as.factor(c("18", "68", "18-20", "22", "27", "16-18", "unkown"))
The levels would follow the order of the elements of the vector. Instead, they are ordered as if R tried to interpret the numeric characters in each element:
[1] 18 68 18-20 22 27 16-18 unkown
Levels: 16-18 18 18-20 22 27 68 unkown
I can see how this should happen if the elements were of class character, but practically integer/numeric. But because of more ambiguous formats such as 18-20
, I am not sure how R knows to order them.
In fact, if I had to transform to factor in two steps (first to integer, and then to factor):
> as.integer(c("18", "68", "18-20", "22", "27", "16-18", "unkown"))
[1] 18 68 NA 22 27 NA NA
Warning message:
NAs introduced by coercion
Which makes perfect sense because 18-20
is a simple character string.
Upvotes: 1
Views: 240
Reputation: 25435
If no set of levels is supplied, the documentation states that:
levels: an optional vector of the values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x)).
So it has nothing to do with the numeric values, they are sorted as if they were strings. And indeed:
> sort(unique(as.character(c("18", "68", "18-20", "22", "27", "16-18", "unkown"))))
[1] "16-18" "18" "18-20" "22" "27" "68" "unkown"
You can prevent the ordering as follows:
> x=c("18", "68", "18-20", "22", "27", "16-18", "unkown")
> factor(x,levels=unique(x))
[1] 18 68 18-20 22 27 16-18 unkown
Levels: 18 68 18-20 22 27 16-18 unkown
Upvotes: 1