xzegga
xzegga

Reputation: 3149

How to get a parse NLP Tree object from bracketed parse string with nltk or spacy?

I have a sentence "You could say that they regularly catch a shower , which adds to their exhilaration and joie de vivre." and I can't achieve to get the NLP parse tree like the following example:

(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))

I want to replicate the solution to this question https://stackoverflow.com/a/39320379 but I have a string sentence instead of the NLP tree.

BTW, I am using python 3

Upvotes: 3

Views: 2328

Answers (2)

Eugene
Eugene

Reputation: 1639

I am going to assume there is a good reason as to why you need the dependency parse tree in that format. Spacy does a great job by using a CNN (Convolutional Neural Network) to produce CFGs (Context-Free Grammars), is production ready, and is super-fast. You can do something like the below to see for yourself (and then read the docs in the prior link):

import spacy

nlp = spacy.load('en')

text = 'You could say that they regularly catch a shower , which adds to their exhilaration and joie de vivre.'

for token in nlp(text):
    print(token.dep_, end='\t')
    print(token.idx, end='\t')
    print(token.text, end='\t')
    print(token.tag_, end='\t')
    print(token.head.text, end='\t')
    print(token.head.tag_, end='\t')
    print(token.head.idx, end='\t')
    print(' '.join([w.text for w in token.subtree]), end='\t')
    print(' '.join([w.text for w in token.children]))

Now, you could make an algorithm to navigate this tree, and print accordingly (I couldn't find a quick example, sorry, but you can see the indexes and how to traverse the parse). Another thing you could do is to extract the CFG somehow, and then use NLTK to do the parsing and subsequent displaying in the format you desire. This is from the NLTK playbook (modified to work with Python 3):

import nltk
from nltk import CFG

grammar = CFG.fromstring("""
  S -> NP VP
  VP -> V NP | V NP PP
  V -> "saw" | "ate"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "dog" | "cat" | "cookie" | "park"
  PP -> P NP
  P -> "in" | "on" | "by" | "with"
  """)

text = 'Mary saw Bob'

sent = text.split()
rd_parser = nltk.RecursiveDescentParser(grammar)
for p in rd_parser.parse(sent):
    print(p)
# (S (NP Mary) (VP (V saw) (NP Bob)))

However, you can see that you need to define the CFG (so if you tried your original text in place of the example's, you saw that it didn't understand the tokens not defined in the CFG).

It seems the easiest way to obtain your desired format is using Stanford's NLP parser. Taken from this SO question (and sorry, I haven't tested it):

parser = StanfordParser(model_path='edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz')
parsed = parser.raw_parse('Jack payed up to 5% more for each unit')
for line in parsed:
    print(line, end=' ') # This will print all in one line, as desired

I didn't test this because I don't have the time to install the Stanford Parser, which can be a bit of a cumbersome process (relative to installing Python modules), that is, assuming you are looking for a Python solution.

I hope this helps, and I'm sorry that it's not a direct answer.

Upvotes: 2

alvas
alvas

Reputation: 122142

Use the Tree.fromstring() method:

>>> from nltk import Tree
>>> parse = Tree.fromstring('(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))')

>>> parse
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['You'])]), Tree('VP', [Tree('MD', ['could']), Tree('VP', [Tree('VB', ['say']), Tree('SBAR', [Tree('IN', ['that']), Tree('S', [Tree('NP', [Tree('PRP', ['they'])]), Tree('ADVP', [Tree('RB', ['regularly'])]), Tree('VP', [Tree('VB', ['catch']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('NN', ['shower'])]), Tree(',', [',']), Tree('SBAR', [Tree('WHNP', [Tree('WDT', ['which'])]), Tree('S', [Tree('VP', [Tree('VBZ', ['adds']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('PRP$', ['their']), Tree('NN', ['exhilaration'])]), Tree('CC', ['and']), Tree('NP', [Tree('FW', ['joie']), Tree('FW', ['de']), Tree('FW', ['vivre'])])])])])])])])])])])])]), Tree('.', ['.'])])])

>>> parse.pretty_print()
                                                       ROOT                                                             
                                                        |                                                                
                                                        S                                                               
  ______________________________________________________|_____________________________________________________________   
 |         VP                                                                                                         | 
 |     ____|___                                                                                                       |  
 |    |        VP                                                                                                     | 
 |    |     ___|____                                                                                                  |  
 |    |    |       SBAR                                                                                               | 
 |    |    |    ____|_______                                                                                          |  
 |    |    |   |            S                                                                                         | 
 |    |    |   |     _______|____________                                                                             |  
 |    |    |   |    |       |            VP                                                                           | 
 |    |    |   |    |       |        ____|______________                                                              |  
 |    |    |   |    |       |       |                   NP                                                            | 
 |    |    |   |    |       |       |         __________|__________                                                   |  
 |    |    |   |    |       |       |        |          |         SBAR                                                | 
 |    |    |   |    |       |       |        |          |      ____|____                                              |  
 |    |    |   |    |       |       |        |          |     |         S                                             | 
 |    |    |   |    |       |       |        |          |     |         |                                             |  
 |    |    |   |    |       |       |        |          |     |         VP                                            | 
 |    |    |   |    |       |       |        |          |     |     ____|____                                         |  
 |    |    |   |    |       |       |        |          |     |    |         PP                                       | 
 |    |    |   |    |       |       |        |          |     |    |     ____|_____________________                   |  
 |    |    |   |    |       |       |        |          |     |    |    |                          NP                 | 
 |    |    |   |    |       |       |        |          |     |    |    |          ________________|________          |  
 NP   |    |   |    NP     ADVP     |        NP         |    WHNP  |    |         NP               |        NP        | 
 |    |    |   |    |       |       |     ___|____      |     |    |    |     ____|_______         |    ____|____     |  
PRP   MD   VB  IN  PRP      RB      VB   DT       NN    ,    WDT  VBZ   TO  PRP$          NN       CC  FW   FW   FW   . 
 |    |    |   |    |       |       |    |        |     |     |    |    |    |            |        |   |    |    |    |  
You could say that they regularly catch  a      shower  ,   which adds  to their     exhilaration and joie  de vivre  . 

Upvotes: 4

Related Questions