Reputation: 764
How do I iterate over utf 8?
import string
for character in string.printable[1:]:
print (character)
Presumably there's a similar approach for UTF-8?
Upvotes: 2
Views: 1178
Reputation: 7501
Presumably there's a similar approach for UTF-8?
Do you want to know which codepoints are printable outside of the ascii range? Or do you want the utf8 encodings of printable characters?
unicode_max = 0x10ffff
printable_glyphs = [ chr(x) for x in range(0, unicode_max+1) if chr(x).isprintable() ]
As mentioned above, utf8 is an encoding. That's when text is mapped to specific bytes, so that other programs can share data.
Text in-memory is not utf8. Every character/glyph has a single codepoint.
import unicodedata
monkey = unicodedata.lookup('monkey')
print(f"""
glyph: {monkey}
codepoint: Dec: {ord(monkey)}
codepoint: Hex: {hex(ord(monkey))}
utf8: { monkey.encode('utf8', errors='strict') }
utf16: { monkey.encode('utf16', errors='strict') }
utf32: { monkey.encode('utf32', errors='strict') }
""")
outputs:
glyph: 🐒
codepoint: Dec: 128018
codepoint: Hex: 0x1f412
utf8: b'\xf0\x9f\x90\x92'
utf16: b'\xff\xfe=\xd8\x12\xdc'
utf32: b'\xff\xfe\x00\x00\x12\xf4\x01\x00'
Upvotes: 1