Mohd Shahid
Mohd Shahid

Reputation: 1606

Converting unicode list to a readable format

I am using polyglot to tokenize text in Burmese language. Here is what I am doing.

    from polyglot.text import Text

    blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
    text = Text(blob)

When I do :

print(text.words)

It outputs in the following format:

[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c', u'\u1000\u1039\u103b', u'\u1019', u'\u1014\u1039', u'\u1019\u102c', u'\u101c\u102f', u'\u1015\u1039', u'\u101e\u102c\u1038', u'\u1019\u103a\u102c\u1038', u'\u1000\u102d\u102f', u'\u101c\u102f\u1036', u'\u107f', u'\u1001\u1033\u1036\u1031', u'\u101b\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015\u107f', u'\u1015\u102e\u1038', u'\u1011\u102d\u102f', u'\u1004\u1039\u1038', u'\u101b\u1032', u'\u1006', u'\u1000\u1039', u'\u101c', u'\u1000\u1039', u'\u1016', u'\u1019\u1039\u1038', u'\u1006\u102e\u1038', u'\u104a', u'\u1027', u'\u100a\u1037\u1039', u'\u1005\u102c', u'\u101b', u'\u1004\u1039\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015', u'\u104d', u'\u1012', u'\u100f\u1039\u1031', u'\u1004\u103c\u1090\u102d\u102f', u'\u1000\u1039']

What output is this? I am not sure why the output is like this. How could I convert it back to the format where I could make some sense out of this?

I had also tried the following:

text.words[1].decode('unicode-escape')

but it throws an error saying: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Upvotes: 0

Views: 658

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177971

That is the way Python 2 prints a list. It is debugging output (see repr()), that unambiguously indicates the content of a list. u'' indicates a Unicode string and \uxxxx indicates a Unicode code point of U+xxxx. The output is all ASCII so it works on any terminal. If you directly print the strings in the list, they will display correctly if your terminal supports the characters being printed. Example:

words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print words
for word in words:
    print word

Output:

[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
ထို
င္းေ
ရာ

To reemphasize, your terminal must be configured with an encoding that supports the Unicode code points (ideally, UTF-8), and use a font that supports the characters as well. Otherwise, you can print the text to a file in UTF-8 encoding, and view the file in an editor that supports UTF-8 and has fonts that support the characters:

import io
with io.open('example.txt','w',encoding='utf8') as f:
    for word in words:
        f.write(word + u'\n')

Switch to Python 3, and things get more simple. It defaults to displaying the characters if the terminal supports it, but you can still get the debugging output as well:

words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print(words)
print(ascii(words))

Output:

['ထို', 'င္းေ', 'ရာ']
['\u1011\u102d\u102f', '\u1004\u1039\u1038\u1031', '\u101b\u102c']

Upvotes: 2

Suhail Gupta
Suhail Gupta

Reputation: 23276

Looks like your terminal is unable to handle the UTF-8 encoded Unicode. Try saving the output by encoding each token into utf-8 as follows.

    # -*- coding: utf-8 -*-

    from _future_ import unicode_literals
    from polyglot.text import Text

    blob = u"""
    ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
    """
    text = Text(blob)


    with open('output.txt', 'a') as the_file:
        for word in text.words:
            the_file.write("\n")
            the_file.write(word.encode("utf-8"))

Upvotes: 0

Related Questions