Henley Wing Chiu
Henley Wing Chiu

Reputation: 22535

Regex to extract Noun Phrases from a Part of Speech parse tree?

I am trying to extract all three word noun phrases from a Stanford POS Parse Tree. Basically, anything that looks like:

(NP (TAG WORD) (TAG WORD) (TAG WORD))

Or:

(NP (TAG WORD) (TAG (TAG WORD) (TAG WORD)))

This is what a parse tree can look like:

(ROOT (SQ (VBZ Is) (NP (DT this)) (NP (DT an) (NN asthma) (NN attack)) (. ?)))

When I do this regex, it extracts the correct 3 word noun phrase:

threeWordNounPhrases = full.scan(/\(NP \([^()]+ [^()]+\) \([^()]+ [^()]+\)\)/)
# => "(NP (DT an) (NN asthma) (NN attack))"

However, this does not work for something like:

(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))

Which should return:

(NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))

Upvotes: 0

Views: 647

Answers (2)

Amadan
Amadan

Reputation: 198436

Specifically for three words, it is possible, but not pretty. For N words, the complexity of the regexp rises. Note that this is just for fun (and regexp/Oniguruma education); in reality, I'd suggest to go with what everyone else says: use a tree parsing library and manipulate the tree.

str = "(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))"

re = /
  (?<tag>
    [A-Z]+
  ){0}

  (?<word>
    \( \g<tag> \s
    (?:
      [^()]+ |
      \g<word>
    )
    \)
  ){0}

  (?<word2>
    \g<word> \s \g<word> |
    \( \g<tag> \s \g<word2> \)
  ){0}

  (?<word3>
    \g<word> \s \g<word> \s \g<word> |
    \g<word2> \s \g<word> |
    \g<word> \s \g<word2> |
    \( \g<tag> \s \g<word3> \)
  ){0}
  \( NP \s \g<word3> \)
/x;

puts str[re]
# => (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))

Upvotes: 2

Stephen Grimes
Stephen Grimes

Reputation: 397

I don't see a way to use regular expressions unless are able to consider all possible structures. What you did works for simple cases, but as you found, it fails with deeper, nested structures. I see two options:

  1. From the point where you encounter (NP in the text, read in additional characters. Keep a running tally of parenthesis. Add to it when you see (, subtract when you see ). When you get to zero, you've reached the end of the NP.

  2. Parse the tree using rubytree. Extract all subtrees that are dominated by a node with label of NP. Convert the subtree back to string form by concatenating leaf nodes.

Upvotes: 0

Related Questions