gavenkoa
gavenkoa

Reputation: 48883

What is alphabetical ordering for sort utility?

I call myself as POSIX shell wizard. But today I have shat into my pants.

So here is nothing strange:

bash# printf 'v10\nv1.' | sort
v1.
v10

because . has code 0x2e and 0 has code 0x30. But how about this:

bash# printf 'v101\nv1.1' | sort
v101
v1.1

WTF? Ok, I am wizard:

$ locale

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME=en_DK.utf8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

So:

bash# printf 'v101\nv1.1' | LC_ALL=C sort
v1.1
v101

How can locales / collation make "v101" < "v1.1"?

I think that en_US.UTF-8 locale have collation rule to strip . sign. This test shown that I have point:

bash# printf 'v102\nv1.01' | LC_ALL=en_US.UTF-8 sort
v1.01
v102

bash# printf 'v102\nv1.03' | LC_ALL=en_US.UTF-8 sort
v102
v1.03

Am I right? And if I am right who didn't like dots? UTF-8 or English-speakers or Americans?

Is that POSIX compatible behavior?

Upvotes: 3

Views: 160

Answers (1)

hookenz
hookenz

Reputation: 38947

Yes it appears that dot is ignored when LOCALE is not C. Also dash is ignored the same way. And sort obeys the locale. Learning something new everyday.

matt@xen:~/dev/OTOY2$ printf "aa\nab\nac\n" | LC_COLLATE=C sort
aa
ab
ac
matth@xen:~/dev/OTOY2$ printf "aa\n.ab\nac\n" | LC_COLLATE=C sort
.ab
aa
ac

matt@xen:~/dev/OTOY2$ printf "aa\nab\nac\n" | sort
aa
ab
ac
matth@xen:~/dev/OTOY2$ printf "aa\n.ab\nac\n" | sort
aa
.ab
ac

You may be interested to know that sort can do natural or numeric sorting too. So 100 10 and 20 can be correctly sorted using -g or -h on the sort.

Under Linux there is also a --debug flag.

matthewh@xen:~/dev/OTOY2$ printf 'v101\nv1.1' | sort --debug
sort: using ‘en_NZ.UTF-8’ sorting rules
v101
____
v1.1
____

I think the whole answer is embedded in this massive spec: http://www.unicode.org/reports/tr10/

Upvotes: 2

Related Questions