J.Doe
J.Doe

Reputation: 464

Comparing hex number ranges for unicode encodings gives unexpected result

I'm using UTF-8 encodings of Japanese Characters to distinguish so called Hiragana characters from Kanji characters in my script. The Unicode table for the relevant characters can be found here. As is evident from the table, the knji I want to identify are part of the last table called "CJK unifed ideographs - Common and uncommon kanji" and are in a number range of (4e00 - 9faf, Hexadeximal numbers). The undesired Hiragana and Katakana Characters are part of the lower numbers (3040-309f, 30a0 - 30ff).

So the idea is quite simple: Encode the first and last kanji characters from the kanji table by using UTF-8 encoding. That way they can be used for comparisons. You iterate over every character in a given string of Japanese characters, encode it using UTF-8 and then compare it to the interval of the kanji characters.

def find_kanji(chars):
    
    min_char = '一'.encode('utf8')    
    max_char = '䶿'.encode('utf8')
    
    found = []
    for char in chars:
        if min_char <= char.encode('utf8') <= max_char:
             found.append(char)

    return found

However, this doesn't work, if I specify the full interval. It only works if the comparison is shortened to

if min_char <= char.encode('utf8'):

I don't know how hexadecimal encoding of strings works and so I can't tell what the problem is exactly. For example, if I encode my characters, they return encodings such as

あ: b'\xe3\x81\x82'
お: b'\xe3\x81\x8a'
青: b'\xe9\x9d\x92'
い: b'\xe3\x81\x84'

and I can't really link an encoding such as b'\xe9\x9d\x92' to a Hex number in the table in the link. The character 青 has the encoding b'\xe9\x9d\x92', yet in the table it is documented with the Hex number 9752 and I don't understand how to translate the one into the other. That's why I'm unable to determine why my number comparison fails.

Example code:

def find_kanji(chars):
    
    min_char = '一'.encode('utf8')    
    max_char = '䶿'.encode('utf8')
    
    found = []
    for char in chars:
        if min_char <= char.encode('utf8') <= max_char: # remove comparison with max_char to make work
             found.append(char)

    return found

print(find_kanji('吸い込む'))

Upvotes: 0

Views: 35

Answers (1)

gog
gog

Reputation: 11347

You don't need this encoding business at all. Just compare unicode characters directly:

def find_kanji(chars):
    min_char = '\u4e00'
    max_char = '\u9faf'

    return [
        c for c in chars
        if min_char <= c <= max_char
    ]

Alternatively, consider using a range regular expression, or the regex module, which allows you to match unicode blocks directly.

Upvotes: 1

Related Questions