Reputation: 79
One node may have more than two children in the parsing tree obtained from the Stanford Parser, such as the englishPCFG.ser.gz How can I obtain a binarized parsing tree with POS tagging information on each node? Is there any parameters to be filled into the parser to achieve this?
Upvotes: 2
Views: 1363
Reputation: 353
The trees are not strictly binary-branching because the Penn treebank on which the parser was trained isn't. This is a theoretical problem with the (now ancient) treebank that continues to bedevil computational linguists!
The way in which I've dealt with this is by writing complex tree-transformation logic that restructures the output of the constituency parser as binary-branching structures, using X-bar-theoretic representations -- in the process promoting functional projections over lexical phrases, raising quantifiers and so on.
[Update] I tried the TreeBinarizer class. It worked well on the one example I used. I'm parsing Spanish, and using Clojure. Here's a sample session:
user=> (import edu.stanford.nlp.parser.lexparser.TreeBinarizer)
edu.stanford.nlp.parser.lexparser.TreeBinarizer
user=> (import edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack)
edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack
user=> (import edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder)
edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder
user=> ; I have a parsed tree:
user=> (.pennPrint t)
(sp
(prep (sp000 a))
(S
(infinitiu (vmn0000 decir))
(S
(conj (cs que))
(grup.verb (vaip000 hemos) (vmp0000 visto))
(sn
(spec (di0000 un))
(grup.nom (nc0s000 relámpago))))))
nil
user=> ; let's create a binarizer
user=> (def tb (TreeBinarizer/simpleTreeBinarizer (SpanishHeadFinder.) (SpanishTreebankLanguagePack.)))
#'user/tb
user=> ; now transform the tree above -- note that the second embedded S node has three children
user=> (.pennPrint (.transformTree tb t))
(sp
(prep (sp000 a))
(S
(infinitiu (vmn0000 decir))
(S
(conj (cs que))
(@S
(grup.verb (vaip000 hemos) (vmp0000 visto))
(sn
(spec (di0000 un))
(grup.nom (nc0s000 relámpago)))))))
nil
user=> ; the binarizer created an intermediate phrasal node @S, pushing the conjuction into <Spec, @S>
Upvotes: 2