How do I generate CFG for any sentence using NLTK Python?

Question

My problem is :

I have a lot of sentences of lot of documents. For every sentence I have to write a CFG using nltk python.

grammar1 = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)

Instead of doing that:

I want a universal CFG which is applicable to any sentence

or

I want CFG generated automatically for each sentence.

I’m struck at this. Please help me overcome this.

alexis · Accepted Answer

If you have one or more parsed sentences, you can extract a CFG that describes them by calling the method productions() on the parsed sentence object (an nltk.Tree). Here's an example with the first 10 sentences of the Penn Treebank corpus:

>>> ruleset = set(rule for tree in nltk.corpus.treebank.parsed_sents()[:10] 
           for rule in tree.productions())
>>> for rule in ruleset:
        print(rule)

NP -> PRP
NP -> DT JJ NN
VP -> VBN S
ADVP-TMP -> RB
IN -> 'among'
NNP -> 'Corp.'
NP -> PRP$ NN NN NNS
NP-SBJ -> DT
RRC -> ADVP-TMP VP
NNP -> 'Journal'
VP -> VBN NP
...

The above will give you 278 rules (including vocabulary items) for those 10 sentences, but it gets better as your sample grows. You can take it from there.

Of course if your sentences aren't parsed yet, you'll first need to parse them.

How do I generate CFG for any sentence using NLTK Python?

Answers (1)

Related Questions