daanoo
daanoo

Reputation: 781

arrange() putting capital letters first

I usually use the arrange() function from dplyr to sort datasets, but it behaved in a way that I couldn't understand. Took me a little while to get to the bottom of this. I've fixed my code and used order() to do the same thing, but now I'm curious. I have used arrange() without thinking twice for ages, and I wonder why this seems to be the default behavior. It looks like it fails to sort alphabetically when capital letters are involved--as in, it believes capital letters should come prior to lowercase letters, even if the latter precede them in the alphabet. Am I missing something?

This is not always a problem, but it did become one for me when I used tapply() immediately after arranging via arrange(), assuming that the data would be sorted in the same way that tapply() sorts when running. Here's an example of arrange() putting "USSR" before "Uganda" and the "Ukraine", whereas order() (correctly, I think!) puts it last.

library(dplyr)
countries<-c("USSR","Uganda","Ukraine")
tmp<-data.frame(countries,stringsAsFactors=F)
tmp %>% arrange(countries) #orders it one way
tmp[order(tmp$countries),] #orders it another way
sort(tmp$countries) #sort agrees with order

I looked around to see whether others had encountered this same problem, and couldn't see anything. Forgive me if this has been discussed previously.

Upvotes: 6

Views: 1752

Answers (2)

Unai Vicente
Unai Vicente

Reputation: 379

According to the documentaton it requires a definition of a locale. In this case the issue would be resolved adding .locale="en" :

tmp %>% arrange(countries, .locale="en")

Upvotes: 3

atiretoo
atiretoo

Reputation: 1902

Yes, the comment from @MrFlick is correct. If I do

Sys.setlocale("LC_COLLATE","C")

then

tmp[order(tmp$countries),]

matches the result from arrange()

Upvotes: 3

Related Questions