Upekha Vandebona
Upekha Vandebona

Reputation: 147

Converting Unicoded text to readable text in Python

I have Unicode text as follows

(S (NP (N \u0db6\u0dbd\u0dbd\u0dcf)) (VP (V \u0db6\u0dbb\u0dc0\u0dcf)))

How do I change this to a readable format by converting the codes '\u0___' in to the relevant readable characters. I'm using python version 2.7

I obtained that output by following code segment in NLTK (3.0) where tree is a nltk.tree.Tree

for tree in treelist1:
    print unicode(str(tree))

I need something like print(TreePrettyPrinter(tree).text()) where it gives unicode compatible output as I wanted, but with a tree layout that I don't want. Is there a method in NLTK to get such a readable text like output too?


Same issue have with the output from

for rule in grammar1.productions():
    print(rule.unicode_repr())

where grammar1 is nltk.grammar.CFG

Output is as follows.

VP -> V
VP -> NP V
N -> '\u0db6\u0dbd\u0dca\u0dbd\u0dcf'
N -> '\u0db8\u0dd2\u0db1\u0dd2\u0dc3\u0dcf'
N -> '\u0db8\u0dda\u0dc3\u0dba'

Final results are perfectly fine. I only have issues with the representation of the output

Upvotes: 4

Views: 872

Answers (1)

Upekha Vandebona
Upekha Vandebona

Reputation: 147

Solutions are there in this question. Also works for Python 2.7

Nothing to do with NLTK. Simple solution is just decode the output text with 'unicode_escape'

print(str(tree).decode('unicode_escape'))

and

print(rule.unicode_repr().decode('unicode_escape'))

For NTLK kind of solution for print the tree of type nltk.tree.Tree as a bracketed text, use the following

print(tree.pformat())

Upvotes: 3

Related Questions