Reputation: 2928
This is related to following questions -
I have python app doing following tasks -
# -*- coding: utf-8 -*-
1. Read unicode text file (non-english) -
def readfile(file, access, encoding):
with codecs.open(file, access, encoding) as f:
return f.read()
text = readfile('teststory.txt','r','utf-8-sig')
This returns given text file as string.
2. Split text into sentences.
3. Go through words in each sentence and identify verbs, nouns etc.
Refer - Searching for Unicode characters in Python and Find word infront and behind of a Python list
4. Add them into separate variables as below
nouns = "CAR" | "BUS" |
verbs = "DRIVES" | "HITS"
5. Now I'm trying to pass them into NLTK context free grammer as below -
grammar = nltk.parse_cfg('''
S -> NP VP
NP -> N
VP -> V | NP V
N -> '''+nouns+'''
V -> '''+verbs+'''
''')
It gives me following error -
line 40, in V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)
How can i overcome this matter and pass variable into NLTK CFG ?
Complete Code - https://dl.dropboxusercontent.com/u/4959382/new.zip
Upvotes: 0
Views: 3035
Reputation: 11781
Overall you have these strategies:
nltk that is installed with pip, 2.0.4 in my case, doesn't accept unicode directly, but accepts quoted unicode constants, that is all of the following appear to work:
In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar')
Out[26]: <Grammar with 2 productions>
In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8"))
Out[27]: <Grammar with 2 productions>
In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape"))
Out[28]: <Grammar with 2 productions>
Note, that I quoted unicode text and not normal text "€"
vs bar
.
Upvotes: 1