Reputation: 147
I have Unicode text as follows
(S (NP (N \u0db6\u0dbd\u0dbd\u0dcf)) (VP (V \u0db6\u0dbb\u0dc0\u0dcf)))
How do I change this to a readable format by converting the codes '\u0___' in to the relevant readable characters. I'm using python version 2.7
I obtained that output by following code segment in NLTK (3.0) where tree is a nltk.tree.Tree
for tree in treelist1:
print unicode(str(tree))
I need something like print(TreePrettyPrinter(tree).text()) where it gives unicode compatible output as I wanted, but with a tree layout that I don't want. Is there a method in NLTK to get such a readable text like output too?
Same issue have with the output from
for rule in grammar1.productions():
print(rule.unicode_repr())
where grammar1 is nltk.grammar.CFG
Output is as follows.
VP -> V
VP -> NP V
N -> '\u0db6\u0dbd\u0dca\u0dbd\u0dcf'
N -> '\u0db8\u0dd2\u0db1\u0dd2\u0dc3\u0dcf'
N -> '\u0db8\u0dda\u0dc3\u0dba'
Final results are perfectly fine. I only have issues with the representation of the output
Upvotes: 4
Views: 872
Reputation: 147
Solutions are there in this question. Also works for Python 2.7
Nothing to do with NLTK. Simple solution is just decode the output text with 'unicode_escape'
print(str(tree).decode('unicode_escape'))
and
print(rule.unicode_repr().decode('unicode_escape'))
For NTLK kind of solution for print the tree of type nltk.tree.Tree as a bracketed text, use the following
print(tree.pformat())
Upvotes: 3