Reputation: 11
I want to create a very simple context-free grammar for Greek language, using nltk
. I run Python 2.7 on Windows.
Here's my code:
# -*- coding: utf-8 -*-
import nltk
grammar = nltk.CFG.fromstring("""
S -> Verb Noun
Verb -> a
Noun -> b
""")
a="κάνω"
b="ποδήλατο"
user_input = "κάνω ποδήλατο"
How can I tell if the user_input
is grammatically correct? I tried:
sent = user_input.split()
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
print tree
but I get the following error, which occurs in the grammar.py
file (line 632), that comes with nltk
:
ValueError: Grammar does not cover some of the input words: u"'\\xce\\xba\\xce\\xac\\xce\\xbd\\xcf\\x89', '\\xcf\\x80\\xce\\xbf\\xce\\xb4\\xce\\xae\\xce\\xbb\\xce\\xb1\\xcf\\x84\\xce\\xbf'".
I only get the error when I use the for
loop. Until that point I receive no error. So I suppose it's some kind of encoding problem which I don't know how to overcome.
Upvotes: 1
Views: 699
Reputation: 122280
Firstly, you have to declare the non-terminals, i.e. the words in the lexicon directly into the CFG grammar if you're using nltk.CFG.fromstring
:
import nltk
grammar = nltk.CFG.fromstring(u"""
S -> Verb Noun
Verb -> "κάνω"
Noun -> "ποδήλατο"
""")
parser = nltk.ChartParser(grammar)
print parser.grammar()
[out]:
Grammar with 3 productions (start state = S)
S -> Verb Noun
Verb -> '\u03ba\u03ac\u03bd\u03c9'
Noun -> '\u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf'
Now we look at your user_input
:
>>> print ["κάνω ποδήλατο"]
['\xce\xba\xce\xac\xce\xbd\xcf\x89 \xcf\x80\xce\xbf\xce\xb4\xce\xae\xce\xbb\xce\xb1\xcf\x84\xce\xbf']
You realize that the string is read as bytecode in python 2.x but in python 3.x, it would have been utf8 by default. Now look at it as we decode it to utf8:
>>> print ["κάνω ποδήλατο".decode('utf8')]
[u'\u03ba\u03ac\u03bd\u03c9 \u03c0\u03bf\u03b4\u03ae\u03bb\u03b1\u03c4\u03bf']
Note that u"κάνω ποδήλατο"
would have the same effect as "κάνω ποδήλατο".decode('utf8')` in explicitly decoding the string when you're hardcoding some variable.
Now it looks like how the grammar is read with nltk.CFG.fromstring()
:
# -*- coding: utf-8 -*-
import nltk
grammar = nltk.CFG.fromstring(u"""
S -> Verb Noun
Verb -> "κάνω"
Noun -> "ποδήλατο"
""")
parser = nltk.ChartParser(grammar)
user_input = u"κάνω ποδήλατο".split()
sent = user_input
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
print tree
[out]:
(S (Verb \u03ba\u03b1\u03bd\u03c9) (Noun \u03c0\u03bf\u03b4\u03b7\u03bb\u03b1\u03c4\u03bf))
But i'm not sure whether you see something weird about the output, it's not exactly in unicode but the unicode byte representation:
>>> x = '\u03ba\u03b1\u03bd\u03c9'
>>> print x
\u03ba\u03b1\u03bd\u03c9
>>> print x.decode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> print x.encode('utf8')
\u03ba\u03b1\u03bd\u03c9
>>> x = u'\u03ba\u03b1\u03bd\u03c9'
>>> print x
κανω
You would need to do this to retrieve your original unicode (thanks to @Kasra, see How to retrieve my unicode from the unicode byte representation ):
>>> s='\u03ba\u03b1\u03bd\u03c9'
>>> print unicode(s,'unicode_escape')
κανω
Upvotes: 2