Reputation: 583
In the following POS tagged sentence (and similar sentences) what regular expression to use in order to capture only two-word noun noun compounds (i.e. \p{Alnum}+_NN[PS]? \p{Alnum}[PS]?
) and avoid capturing two-word matches that are part of larger phrases.
I_PRP will_MD never_RB go_VB to_IN sun_NN devil_NN auto_NN again_RB but_CC my_PRP$ family_NN members_NNS will_MD ._.
In particular I would like to capture family_NN members_NN
but not sun_NN devil_NN
and devil_NN auto_NN
.
Currently I use the following regex with positive lookahead:
"(?=\\b([\\p{Alnum}]+)_(NN[SP]?)\\s([\\p{Alnum}]+)_(NN[SP]?)\\b)."
The problem is in addition to family_NN members_NNS
it captures sun_NN devil_NN
, devil_NN auto_NN
.
Upvotes: 1
Views: 63
Reputation: 121710
You need both a lookahead and a lookbehind here.
Basically, you want, for some pattern P
, that PP
is matched if and only if there is not a P
before or after it.
Crude way, with the lookahead and lookbehind operators:
(?<!P)PP(?!P)
The (?<!...)
and (?!...)
are respectively the negative lookbehind and negative lookahead anchors in regexes, where ...
stands for the regex.
If we take P
to be:
[\p{AlNum}]+_NN[PS]?
and accounting for spaces, then one sketch of a solution, allowing for spaces between each token, would look like:
private static final String P = "[\\p{AlNum}]+_NN[PS]?";
private static final String RE = "(?<!" + P + ")"
+ "\\s+(" + P + "\\s+" + P + ")\\s+(?!" + P + ")";
private static final Pattern PATTERN = Pattern.compile(RE);
This is only a sketch however.
Given the complexity of the input, you probably want to do more, so not sure that regexes are the tool you are really looking for in the end.
Upvotes: 1