How to properly tabulate unicode data

Question

(I am on python 2.7)

I have this test:

# -*- coding: utf-8 -*-

import binascii

test_cases = [
    'aaaaa',    # Normal bytestring
    'ááááá',    # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded
    'ℕℤℚℝℂ',    # Encoded unicode. The editor has encoded this, and it is defined as string, so it is left encoded by python
    u'aaaaa',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
    u'ááááá',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
    u'ℕℤℚℝℂ',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
]
FORMAT = '%-20s -> %2d %-20s %-30s %-30s'
for data in test_cases :
    try:
        hexlified = binascii.hexlify(data)
    except:
        hexlified = None
    print FORMAT % (data, len(data), type(data), hexlified, repr(data))

Which produces the output:

aaaaa                ->  5          6161616161                     'aaaaa'                       
ááááá           -> 10          c3a1c3a1c3a1c3a1c3a1           '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'
ℕℤℚℝℂ      -> 15          e28495e284a4e2849ae2849de28482 '\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82'
aaaaa                ->  5      6161616161                     u'aaaaa'                      
ááááá                ->  5      None                           u'\xe1\xe1\xe1\xe1\xe1'       
ℕℤℚℝℂ                ->  5      None                           u'\u2115\u2124\u211a\u211d\u2102'

As you can see, the columns are not properly aligned for the strings with non-ascii characters. This is because the length of those strings, in bytes, is more than the number of unicode characters. How can I tell print to take into account the number of characters, and not the number of bytes when padding the fields?

Eric · Accepted Answer

When python 2.7 sees 'ℕℤℚℝℂ' it reads "here are 15 arbitrary byte values". It has no knownledge of what characters they represent, nor the encoding by which they represent them. You need to decode this byte string into a unicode string, specifying the encoding, before you can expect python to be able to count characters:

for data in test_cases :
    if isinstance(data, bytes):
        data = data.decode('utf-8')
    print FORMAT % (data, len(data), type(data), repr(data))

Note than in python 3, all string literals are by default unicode objects

How to properly tabulate unicode data

Answers (1)

Related Questions