Reputation: 101
I have a context free grammar (CFG) which involves punctuation. e.g. nltk.parse_cfg("""PP-CLR -> IN `` NP-TTL""")
The `` is a valid Penn Treebank POS tag. But nltk cannot recognize it. In fact, nltk.parse_cfg cannot recognize any character other than alphanumeric and dash. While Penn Treebank POS tag has several punctuation, such as $ # : . (
Then, should I keep the punctuation in my dataset? Or is there anyway to parse these characters?
Thanks
Upvotes: 3
Views: 907
Reputation: 933
For people using the current generation of NLTK, you can add Non-Terminals that include special characters by manually updating the set of productions of the grammar object. Below, I added the tag/non-terminal PRP$
which contains the special character $
from nltk.grammar import Production
from nltk.grammar import Nonterminal
productions = my_grammar.productions()
productions.extend([Production(Nonterminal('Nom'),[Nonterminal('PRP$')])])
This is equivalent to adding the following to our CFG :
Nom -> PRP$
Using nltk.CFG.fromstring("Nom -> PRP$")
instead throws an error.
Upvotes: 0
Reputation: 122280
You might need to specially specify them as terminal notes, for e.g. :
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... VP -> V PUNCT
... PUNCT -> '.'
... V -> 'eat'
... NP -> 'I'
... """)
>>>
>>> sentence = "I eat .".split()
>>> cp = nltk.ChartParser(grammar)
>>> for tree in cp.nbest_parse(sentence):
... print tree
...
(S (NP I) (VP (V eat) (PUNCT .)))
Upvotes: 3