How to parse the special character in Context Free Grammar?

Question

I have a context free grammar (CFG) which involves punctuation. e.g. nltk.parse_cfg("""PP-CLR -> IN `` NP-TTL""")

The `` is a valid Penn Treebank POS tag. But nltk cannot recognize it. In fact, nltk.parse_cfg cannot recognize any character other than alphanumeric and dash. While Penn Treebank POS tag has several punctuation, such as $ # : . (

Then, should I keep the punctuation in my dataset? Or is there anyway to parse these characters?

Thanks

alvas · Accepted Answer

You might need to specially specify them as terminal notes, for e.g. :

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... VP -> V PUNCT
... PUNCT -> '.'
... V -> 'eat'
... NP -> 'I'
... """)
>>> 
>>> sentence = "I eat .".split()
>>> cp = nltk.ChartParser(grammar)
>>> for tree in cp.nbest_parse(sentence):
...     print tree
... 
(S (NP I) (VP (V eat) (PUNCT .)))

How to parse the special character in Context Free Grammar?

Answers (2)

Related Questions