Kristine T
Kristine T

Reputation: 101

How to parse the special character in Context Free Grammar?

I have a context free grammar (CFG) which involves punctuation. e.g. nltk.parse_cfg("""PP-CLR -> IN `` NP-TTL""")

The `` is a valid Penn Treebank POS tag. But nltk cannot recognize it. In fact, nltk.parse_cfg cannot recognize any character other than alphanumeric and dash. While Penn Treebank POS tag has several punctuation, such as $ # : . (

Then, should I keep the punctuation in my dataset? Or is there anyway to parse these characters?

Thanks

Upvotes: 3

Views: 907

Answers (2)

Harsh Verma
Harsh Verma

Reputation: 933

For people using the current generation of NLTK, you can add Non-Terminals that include special characters by manually updating the set of productions of the grammar object. Below, I added the tag/non-terminal PRP$ which contains the special character $

from nltk.grammar import Production
from nltk.grammar import Nonterminal
productions = my_grammar.productions()
productions.extend([Production(Nonterminal('Nom'),[Nonterminal('PRP$')])])

This is equivalent to adding the following to our CFG :

Nom -> PRP$

Using nltk.CFG.fromstring("Nom -> PRP$") instead throws an error.

Upvotes: 0

alvas
alvas

Reputation: 122280

You might need to specially specify them as terminal notes, for e.g. :

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... VP -> V PUNCT
... PUNCT -> '.'
... V -> 'eat'
... NP -> 'I'
... """)
>>> 
>>> sentence = "I eat .".split()
>>> cp = nltk.ChartParser(grammar)
>>> for tree in cp.nbest_parse(sentence):
...     print tree
... 
(S (NP I) (VP (V eat) (PUNCT .)))

Upvotes: 3

Related Questions