Matthew
Matthew

Reputation: 79

How to obtain binarized parsing tree from Stanford Parser?

One node may have more than two children in the parsing tree obtained from the Stanford Parser, such as the englishPCFG.ser.gz How can I obtain a binarized parsing tree with POS tagging information on each node? Is there any parameters to be filled into the parser to achieve this?

Upvotes: 2

Views: 1363

Answers (1)

John Stewart
John Stewart

Reputation: 353

The trees are not strictly binary-branching because the Penn treebank on which the parser was trained isn't. This is a theoretical problem with the (now ancient) treebank that continues to bedevil computational linguists!

The way in which I've dealt with this is by writing complex tree-transformation logic that restructures the output of the constituency parser as binary-branching structures, using X-bar-theoretic representations -- in the process promoting functional projections over lexical phrases, raising quantifiers and so on.

[Update] I tried the TreeBinarizer class. It worked well on the one example I used. I'm parsing Spanish, and using Clojure. Here's a sample session:

user=> (import edu.stanford.nlp.parser.lexparser.TreeBinarizer)
edu.stanford.nlp.parser.lexparser.TreeBinarizer
user=> (import     edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack)
edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack
user=> (import     edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder)
edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder
user=> ; I have a parsed tree:

user=> (.pennPrint t)
(sp
  (prep (sp000 a))
  (S
    (infinitiu (vmn0000 decir))
    (S
      (conj (cs que))
      (grup.verb (vaip000 hemos) (vmp0000 visto))
      (sn
        (spec (di0000 un))
        (grup.nom (nc0s000 relámpago))))))
nil
user=> ; let's create a binarizer

user=> (def tb (TreeBinarizer/simpleTreeBinarizer (SpanishHeadFinder.) (SpanishTreebankLanguagePack.)))
#'user/tb
user=> ; now transform the tree above -- note that the second embedded S node has three children

user=> (.pennPrint (.transformTree tb t))
(sp
  (prep (sp000 a))
  (S
    (infinitiu (vmn0000 decir))
    (S
      (conj (cs que))
      (@S
        (grup.verb (vaip000 hemos) (vmp0000 visto))
        (sn
          (spec (di0000 un))
          (grup.nom (nc0s000 relámpago)))))))
nil
user=> ; the binarizer created an intermediate phrasal node @S, pushing the conjuction into <Spec, @S>

Upvotes: 2

Related Questions