Nicholas Saunders
Nicholas Saunders

Reputation: 764

how to iterate over UTF-8 in Python?

How do I iterate over utf 8?

import string

for character in string.printable[1:]:
    print (character)

Presumably there's a similar approach for UTF-8?

Upvotes: 2

Views: 1178

Answers (1)

ninMonkey
ninMonkey

Reputation: 7501

Presumably there's a similar approach for UTF-8?

Do you want to know which codepoints are printable outside of the ascii range? Or do you want the utf8 encodings of printable characters?

To get all printable codepoints for all of unicode:

unicode_max = 0x10ffff
printable_glyphs = [ chr(x) for x in range(0, unicode_max+1) if chr(x).isprintable() ]

As mentioned above, utf8 is an encoding. That's when text is mapped to specific bytes, so that other programs can share data.

Text in-memory is not utf8. Every character/glyph has a single codepoint.

Converting to utf-8

import unicodedata
monkey = unicodedata.lookup('monkey')

print(f"""
    glyph: {monkey}
    codepoint: Dec: {ord(monkey)}
    codepoint: Hex:  {hex(ord(monkey))}

    utf8: { monkey.encode('utf8', errors='strict') }
    utf16: { monkey.encode('utf16', errors='strict') }
    utf32: { monkey.encode('utf32', errors='strict') }
""")

outputs:

glyph: 🐒
codepoint: Dec: 128018
codepoint: Hex:  0x1f412

 utf8: b'\xf0\x9f\x90\x92'
utf16: b'\xff\xfe=\xd8\x12\xdc'
utf32: b'\xff\xfe\x00\x00\x12\xf4\x01\x00'

Upvotes: 1

Related Questions