how to iterate over UTF-8 in Python?

Question

How do I iterate over utf 8?

import string

for character in string.printable[1:]:
    print (character)

Presumably there's a similar approach for UTF-8?

ninMonkey · Accepted Answer

Presumably there's a similar approach for UTF-8?

Do you want to know which codepoints are printable outside of the ascii range? Or do you want the utf8 encodings of printable characters?

To get all printable codepoints for all of unicode:

unicode_max = 0x10ffff
printable_glyphs = [ chr(x) for x in range(0, unicode_max+1) if chr(x).isprintable() ]

As mentioned above, utf8 is an encoding. That's when text is mapped to specific bytes, so that other programs can share data.

Text in-memory is not utf8. Every character/glyph has a single codepoint.

Converting to utf-8

import unicodedata
monkey = unicodedata.lookup('monkey')

print(f"""
    glyph: {monkey}
    codepoint: Dec: {ord(monkey)}
    codepoint: Hex:  {hex(ord(monkey))}

    utf8: { monkey.encode('utf8', errors='strict') }
    utf16: { monkey.encode('utf16', errors='strict') }
    utf32: { monkey.encode('utf32', errors='strict') }
""")

outputs:

glyph: 🐒
codepoint: Dec: 128018
codepoint: Hex:  0x1f412

 utf8: b'\xf0\x9f\x90\x92'
utf16: b'\xff\xfe=\xd8\x12\xdc'
utf32: b'\xff\xfe\x00\x00\x12\xf4\x01\x00'

how to iterate over UTF-8 in Python?

Answers (1)

To get all printable codepoints for all of unicode:

Converting to utf-8

Related Questions