cuongbn
cuongbn

Reputation: 1

Error unicode string, look the same but the essence is not the same

Eror text unicode

When I use CURL to get the content of a website, the content I get looks the same but the reality is different. This affects the processing and comparison of documents. Is there a way to convert $loi to $check standard form so I can handle it correctly?

You can copy the content $loi or $check into cmd window to immediately see the difference as shown in the picture

$loi = 'người được tiêm';
    $check = 'người được tiêm';
    var_dump($loi);
    var_dump($check);

Upvotes: 0

Views: 54

Answers (1)

paxdiablo
paxdiablo

Reputation: 881463

There are codepoints that look the same in Unicode in the sense that they have similar or even identical graphemes, but are not actually the same.

That should come as little surprise when you realise the intent of Unicode is to hold as many languages as possible, and it generally indicates the purpose of a letter, rather than the form.

For example U+2010 and U+2011 (hyphen and non-breaking hyphemn) are likely to look exactly the same since the latter is simply the non-breaking version of the former.

If you pump your two strings into the Unicode to code points converter, you'll see the difference.

For brevity, I've only done the first word of each, and given the codepoints in hex with square brackets around each "character" :

người [6e] [67] [75 31b] [6f 31b 300] [69]
người [6e] [67] [1b0]    [1edd]       [69]

For example, the in the first one is 75 31b, which is Latin small letter U followed by combining horn (a modifier to the letter). In the second, it's the single 1b0, Latin small letter U with horn (modifier built in to the codepoint already).

Similarly, ờ is 6f 31b 300 in the first, three separate codepoints representing the Latin small letter O, a combining horn modifier, and a combining grave accent modifier. The second has this as 1edd with both modifiers already incorporated into the single codepoint, Latin small letter O with horn and grave.

So, in these cases, it's not actually a different intent to the grapheme, but it is a different way of representing it, either:

  • a single codepoint with modifiers built in; or
  • a codepoint with separate additional modifier codepoints.

If you need to treat them the same, Unicode has a concept of equivalence and normalisation.

Equivalence indicates multiple codepoint sequences that are effectively variants of the same "thing", and normalisation is the process of mapping equivalents to a single variant so that comparison is easier.

In Python, I would use the following to map one way or the other:

import unicodedata
normalised_composed = unicodedata.normalize('NFC', 'người'))
normalised_decomposed = unicodedata.normalize('NFD', 'người'))
# Composed is short sequence (minimal codepoints), decomposed is long.

The following transcript shows the outputs, though I've reformatted and annotated for readability:

>>> bytearray('người', 'utf-16')
bytearray(b'\xff\xfe                # Unicode BOM for UTF-16.
    n\x00                           # n.
    g\x00                           # g.
    u\x00 \x1b\x03                  # u, combining horn.
    o\x00 \x1b\x03 \x00\x03         # o, combining horn & grave.
    i\x00                           # i.
')

>>> bytearray(unicodedata.normalize('NFD', 'người'), 'utf-16')
bytearray(b'\xff\xfe                # Identical to previous, it
    n\x00                           #   was already decomposed.
    g\x00
    u\x00 \x1b\x03
    o\x00 \x1b\x03 \x00\x03
    i\x00
')

>>> bytearray(unicodedata.normalize('NFC', 'người'), 'utf-16')
bytearray(b'\xff\xfe                # BOM.
    n\x00                           # n.
    g\x00                           # g.
    \xb0\x01                        # Latin u with horn.
    \xdd\x1e                        # Latin o with horn & grave.
    i\x00                           # i.
')

I'm not entirely certain what language you're using (there's currently no tag) but, if it claims to handle Unicode, it should hopefully have equivalent functions to do this (hence I still consider this answer useful if you add the tag later).

Simply search for <your_language> unicode normalisation in your search engine of choice.

Upvotes: 2

Related Questions