Reputation: 43148
In the swift documentation for comparing strings, I found the following:
Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent. Extended grapheme clusters are canonically equivalent if they have the same linguistic meaning and appearance, even if they are composed from different Unicode scalars behind the scenes.
Then the documentation proceeds with the following example which shows two strings that are "cannonically equivalent"
For example, LATIN SMALL LETTER E WITH ACUTE (U+00E9) is canonically equivalent to LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301). Both of these extended grapheme clusters are valid ways to represent the character é, and so they are considered to be canonically equivalent:
Ok. Somehow e
and é
look the same and also have the same linguistic meaning. Sure I'll give them that. I have taken a Spanish class sometime and the prof wasn't too strict on whether we used either forms of e
, so I'm guessing this is what they are referring to. Fair enough
The documentation goes further to show two strings that are not canonically equivalent:
Conversely, LATIN CAPITAL LETTER A (U+0041, or "A"), as used in English, is not equivalent to CYRILLIC CAPITAL LETTER A (U+0410, or "А"), as used in Russian. The characters are visually similar, but do not have the same linguistic meaning:
Now here is where the alarm bells go off and I decide to ask this question. It seems that appearance has nothing to do with it because the two strings look exactly the same, and they also admit this in the documentation. So it seems that what the string class is really looking for is linguistic meaning
?
This is why I ask what it means by the strings having the same/different linguistic meaning, because e
is the only form of e
that I know which is mainly used in English, but I have only seen é
being used in languages like French or Spanish, so why is it that the given that А
is used in Russian and A
is used in English, is what causes the string class to say that they are not equivalent?
I hope I was able to walk you through my thought process, now my question is what does it mean for two strings to have the same linguistic meaning (in code if possible)?
Upvotes: 0
Views: 211
Reputation: 385690
You said:
Somehow e and é look the same and also have the same linguistic meaning.
No. You have misread the document. Here's the document again:
LATIN SMALL LETTER E WITH ACUTE (U+00E9) is canonically equivalent to LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301).
Here's U+00E9: é
Here's U+0065: e
Here's U+0301: ´
Here's U+0065 followed by U+0301: é
So U+00E9 (é) looks and means the same as U+0065 U+0301 (é). Therefore they must be treated as equal.
So why is Cyrillic А different from Latin A? UTN #26 gives several reasons. Here are some:
“Traditional graphology has always treated them as distinct scripts, …”
“Literate users of Latin, Greek, and Cyrillic alphabets do not have cultural conventions of treating each other's alphabets and letters as part of their own writing systems.”
“Even more significantly, from the point of view of the problem of character encoding for digital textual representation in information technology, the preexisting identification of Latin, Greek, and Cyrillic as distinct scripts was carried over into character encoding, from the very earliest instances of such encodings.”
“[A] unified encoding of Latin, Greek, and Cyrillic would make casing operations an unholy mess, …”
Read the tech note for full details.
Upvotes: 3