Valdir Stumm Junior
Valdir Stumm Junior

Reputation: 4667

String ordering in Lua

I'm reading Programming in Lua, 1st edition (yup, I know it's a bit outdated), and in the section 3.2 (about relational operators), the author says:

For instance, with the European Latin-1 locale, we have "acai" < "açaí" < "acorde".

I don't get it. For me, it's OK to have "acai" < "açaí", but why is "açaí" < "acorde"?

AFAIK (and wikipedia seems to confirm), "c" < "ç", or am I wrong?

Upvotes: 9

Views: 1405

Answers (2)

Tom Blodget
Tom Blodget

Reputation: 20772

You reference a code page, which maps codepoints to characters. Certainly codepoints, being a finite set of non-negative integers, are well-ordered, distinct entities. However, that is not what characters are about.

Characters have a collation order, which is a partial ordering: Characters can be "equal" but not the same. Collation is a user-valued concept that varies by locale (and over time).

Strings are even more complicated because some character sets (e.g. Unicode) can have combining characters. That allows a "character" to be represented as a single character or as a base character followed by the combining characters. For example, "ä" vs "a¨". Since they represent the same conceptual character they should be considered even more equal than "ä" vs "a".

In Spanish, "ch", "rr" and "ll" used to be letters in the alphabet and words were ordered accordingly; Now, they are not but "ñ" still is.

Similarly, in the past it was not uncommon for English-speakers to sort surnames beginning with "Mc" and "Mac" after others beginning with "M".

Software libraries have to deal with such things because that's what users want. Thankfully, some of the older conventions have fallen from use.


So, a locale could very well have collation rules that result in "acai" < "açaí" < "acorde" if "c" has the same sort order as "ç" but "i" comes before "í". This case seems strange though the possibility in general requires our code to allow it.

Upvotes: 5

Yu Hao
Yu Hao

Reputation: 122383

In the third edition of PiL, this statement has been modified:

For instance, with a Portuguese Latin-1 locale, we have "acai"<"açaí"<"acorde".

So the locale needs to be set to Portuguese Latin-1 accordingly:

print("acai" < "açaí")
print("açaí" < "acorde")

print(os.setlocale("pt_PT"))

print("acai" < "açaí")
print("açaí" < "acorde")

On ideone, the result is:

true
false
pt_PT.iso88591
false
true

But the order of "acai" and "açaí" seems to be different from the book now.

Upvotes: 8

Related Questions