tbraun89
tbraun89

Reputation: 2234

Match regex groups to list in java (Hearst Pattern)

I'm trying to match Hearst-Patterns with Java regex this is my regex:

<np>(\w+)<\/np> such as (?:(?:, | or | and )?<np>(\w+)<\/np>)*

If I have a annotated sentence like:

I have a <np>car</np> such as <np>BMW</np>, <np>Audi</np> or <np>Mercedes</np> and this can drive fast.

I want to get the groups:

1. car
2. [BMW, Audi, Mercedes]

UPDATE: Here is my current java code:

Pattern pattern = Pattern.compile("<np>(\\w+)<\\/np> such as (?:(?:, | or | and )?<np>(\\w+)<\\/np>)*");
Matcher matcher = pattern.matcher("I have a <np>car</np> such as <np>BMW</np>, <np>Audi</np> or <np>Mercedes</np> and this can drive fast.");

while (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
}

But the 2nd group element only contains Mercedes, how can I get all the matches for the 2nd group (maby as array)? Is this possible with java Pattern and Matcher? And if yes, what is my mistake?

Upvotes: 4

Views: 1314

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89629

If you want to be sure to have contiguous results, you can use the \G anchor that forces a match to be contiguous to a precedent match:

Pattern p = Pattern.compile("<np>(\\w+)</np> such as|\\G(?:,| or| and)? <np>(\\w+)</np>");

note: the \G anchor means the end of a precedent match or the start of the string. To avoid to match the start of the string, you can add the lookbehind (?<!^) after the \G

Upvotes: 2

Related Questions