Reputation: 781
I usually use the arrange()
function from dplyr
to sort datasets, but it behaved in a way that I couldn't understand. Took me a little while to get to the bottom of this. I've fixed my code and used order()
to do the same thing, but now I'm curious. I have used arrange()
without thinking twice for ages, and I wonder why this seems to be the default behavior. It looks like it fails to sort alphabetically when capital letters are involved--as in, it believes capital letters should come prior to lowercase letters, even if the latter precede them in the alphabet. Am I missing something?
This is not always a problem, but it did become one for me when I used tapply()
immediately after arranging via arrange()
, assuming that the data would be sorted in the same way that tapply()
sorts when running. Here's an example of arrange()
putting "USSR" before "Uganda" and the "Ukraine", whereas order()
(correctly, I think!) puts it last.
library(dplyr)
countries<-c("USSR","Uganda","Ukraine")
tmp<-data.frame(countries,stringsAsFactors=F)
tmp %>% arrange(countries) #orders it one way
tmp[order(tmp$countries),] #orders it another way
sort(tmp$countries) #sort agrees with order
I looked around to see whether others had encountered this same problem, and couldn't see anything. Forgive me if this has been discussed previously.
Upvotes: 6
Views: 1752
Reputation: 379
According to the documentaton it requires a definition of a locale. In this case the issue would be resolved adding .locale="en"
:
tmp %>% arrange(countries, .locale="en")
Upvotes: 3
Reputation: 1902
Yes, the comment from @MrFlick is correct. If I do
Sys.setlocale("LC_COLLATE","C")
then
tmp[order(tmp$countries),]
matches the result from arrange()
Upvotes: 3