Bablu Joshi
Bablu Joshi

Reputation: 389

Whats does “extended grapheme clusters are canonically equivalent” means in terms of Swift String?

https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html says:

Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent. Extended grapheme clusters are canonically equivalent if they have the same linguistic meaning and appearance, even if they’re composed from different Unicode scalars behind the scenes.

What is meant by extended grapheme cluster ?

Upvotes: 0

Views: 787

Answers (1)

MANIAK_dobrii
MANIAK_dobrii

Reputation: 6032

As it is mentioned in the document you're referencing:

An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.

That is, an extended grapheme cluster is a single "visible character", usually, the cursor jumps around it in a text editor.

For example, ё and ё both look identical, each of those is an extended grapheme cluster, but the first one is produced by a single Unicode scalar (or code point), while the second one by two:

  • ё = [ё CYRILLIC SMALL LETTER IO]
  • ё = [е CYRILLIC SMALL LETTER IE, "̈ COMBINING DIAERESIS]

ё and ё are canonically equivalent, so, even though they are produced from different sequences of Unicode code points, they are considered equal (as opposed to, for example, -[NSString isEqualToString:] that compares exact UTF-16 code units):

let e1 = "ё"
let e2 = "ё"

e1.unicodeScalars.count // 1
e2.unicodeScalars.count // 2
e1 == e2 // true, because Swift String uses canonical equivalence
(e1 as NSString).isEqual(to: e2) // false, because NSString compares UTF-16 code units

To be precise, "extended grapheme cluster" is one of the text segmentation algorithms from the Unicode standard. Unencoded Unicode text is a sequence of Unicode code points. Text segmentation algorithm analyzes this sequence and identifies the boundaries of the "visible characters" (that are also called "extended grapheme clusters").

If you're interested in details, a good place to start is Glossary of Unicode Terms.

Upvotes: 5

Related Questions