Reputation: 22535
I am trying to extract all three word noun phrases from a Stanford POS Parse Tree. Basically, anything that looks like:
(NP (TAG WORD) (TAG WORD) (TAG WORD))
Or:
(NP (TAG WORD) (TAG (TAG WORD) (TAG WORD)))
This is what a parse tree can look like:
(ROOT (SQ (VBZ Is) (NP (DT this)) (NP (DT an) (NN asthma) (NN attack)) (. ?)))
When I do this regex, it extracts the correct 3 word noun phrase:
threeWordNounPhrases = full.scan(/\(NP \([^()]+ [^()]+\) \([^()]+ [^()]+\)\)/)
# => "(NP (DT an) (NN asthma) (NN attack))"
However, this does not work for something like:
(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))
Which should return:
(NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))
Upvotes: 0
Views: 647
Reputation: 198436
Specifically for three words, it is possible, but not pretty. For N words, the complexity of the regexp rises. Note that this is just for fun (and regexp/Oniguruma education); in reality, I'd suggest to go with what everyone else says: use a tree parsing library and manipulate the tree.
str = "(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))"
re = /
(?<tag>
[A-Z]+
){0}
(?<word>
\( \g<tag> \s
(?:
[^()]+ |
\g<word>
)
\)
){0}
(?<word2>
\g<word> \s \g<word> |
\( \g<tag> \s \g<word2> \)
){0}
(?<word3>
\g<word> \s \g<word> \s \g<word> |
\g<word2> \s \g<word> |
\g<word> \s \g<word2> |
\( \g<tag> \s \g<word3> \)
){0}
\( NP \s \g<word3> \)
/x;
puts str[re]
# => (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))
Upvotes: 2
Reputation: 397
I don't see a way to use regular expressions unless are able to consider all possible structures. What you did works for simple cases, but as you found, it fails with deeper, nested structures. I see two options:
From the point where you encounter (NP
in the text, read in additional characters. Keep a running tally of parenthesis. Add to it when you see (
, subtract when you see )
. When you get to zero, you've reached the end of the NP
.
Parse the tree using rubytree. Extract all subtrees that are dominated by a node with label of NP
. Convert the subtree back to string form by concatenating leaf nodes.
Upvotes: 0