Reputation: 1606
I am using polyglot to tokenize text in Burmese language. Here is what I am doing.
from polyglot.text import Text
blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
text = Text(blob)
When I do :
print(text.words)
It outputs in the following format:
[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c', u'\u1000\u1039\u103b', u'\u1019', u'\u1014\u1039', u'\u1019\u102c', u'\u101c\u102f', u'\u1015\u1039', u'\u101e\u102c\u1038', u'\u1019\u103a\u102c\u1038', u'\u1000\u102d\u102f', u'\u101c\u102f\u1036', u'\u107f', u'\u1001\u1033\u1036\u1031', u'\u101b\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015\u107f', u'\u1015\u102e\u1038', u'\u1011\u102d\u102f', u'\u1004\u1039\u1038', u'\u101b\u1032', u'\u1006', u'\u1000\u1039', u'\u101c', u'\u1000\u1039', u'\u1016', u'\u1019\u1039\u1038', u'\u1006\u102e\u1038', u'\u104a', u'\u1027', u'\u100a\u1037\u1039', u'\u1005\u102c', u'\u101b', u'\u1004\u1039\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015', u'\u104d', u'\u1012', u'\u100f\u1039\u1031', u'\u1004\u103c\u1090\u102d\u102f', u'\u1000\u1039']
What output is this? I am not sure why the output is like this. How could I convert it back to the format where I could make some sense out of this?
I had also tried the following:
text.words[1].decode('unicode-escape')
but it throws an error saying: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Upvotes: 0
Views: 658
Reputation: 177971
That is the way Python 2 prints a list. It is debugging output (see repr()), that unambiguously indicates the content of a list. u''
indicates a Unicode string and \uxxxx
indicates a Unicode code point of U+xxxx. The output is all ASCII so it works on any terminal. If you directly print the strings in the list, they will display correctly if your terminal supports the characters being printed. Example:
words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print words
for word in words:
print word
Output:
[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
ထို
င္းေ
ရာ
To reemphasize, your terminal must be configured with an encoding that supports the Unicode code points (ideally, UTF-8), and use a font that supports the characters as well. Otherwise, you can print the text to a file in UTF-8 encoding, and view the file in an editor that supports UTF-8 and has fonts that support the characters:
import io
with io.open('example.txt','w',encoding='utf8') as f:
for word in words:
f.write(word + u'\n')
Switch to Python 3, and things get more simple. It defaults to displaying the characters if the terminal supports it, but you can still get the debugging output as well:
words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print(words)
print(ascii(words))
Output:
['ထို', 'င္းေ', 'ရာ']
['\u1011\u102d\u102f', '\u1004\u1039\u1038\u1031', '\u101b\u102c']
Upvotes: 2
Reputation: 23276
Looks like your terminal is unable to handle the UTF-8 encoded Unicode. Try saving the output by encoding each token into utf-8
as follows.
# -*- coding: utf-8 -*-
from _future_ import unicode_literals
from polyglot.text import Text
blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
text = Text(blob)
with open('output.txt', 'a') as the_file:
for word in text.words:
the_file.write("\n")
the_file.write(word.encode("utf-8"))
Upvotes: 0