MAZDAK
MAZDAK

Reputation: 583

How to avoid capturing overlapping patterns while using java regex?

In the following POS tagged sentence (and similar sentences) what regular expression to use in order to capture only two-word noun noun compounds (i.e. \p{Alnum}+_NN[PS]? \p{Alnum}[PS]?) and avoid capturing two-word matches that are part of larger phrases.

I_PRP will_MD never_RB go_VB to_IN sun_NN devil_NN auto_NN again_RB but_CC my_PRP$ family_NN members_NNS will_MD ._.

In particular I would like to capture family_NN members_NN but not sun_NN devil_NN and devil_NN auto_NN.

Currently I use the following regex with positive lookahead:

"(?=\\b([\\p{Alnum}]+)_(NN[SP]?)\\s([\\p{Alnum}]+)_(NN[SP]?)\\b)."

The problem is in addition to family_NN members_NNS it captures sun_NN devil_NN, devil_NN auto_NN.

Upvotes: 1

Views: 63

Answers (1)

fge
fge

Reputation: 121710

You need both a lookahead and a lookbehind here.

Basically, you want, for some pattern P, that PP is matched if and only if there is not a P before or after it.

Crude way, with the lookahead and lookbehind operators:

(?<!P)PP(?!P)

The (?<!...) and (?!...) are respectively the negative lookbehind and negative lookahead anchors in regexes, where ... stands for the regex.

If we take P to be:

[\p{AlNum}]+_NN[PS]?

and accounting for spaces, then one sketch of a solution, allowing for spaces between each token, would look like:

private static final String P = "[\\p{AlNum}]+_NN[PS]?";
private static final String RE = "(?<!" + P + ")"
    + "\\s+(" + P + "\\s+" + P + ")\\s+(?!" + P + ")";
private static final Pattern PATTERN = Pattern.compile(RE);

This is only a sketch however.

Given the complexity of the input, you probably want to do more, so not sure that regexes are the tool you are really looking for in the end.

Upvotes: 1

Related Questions