Mihail Burduja
Mihail Burduja

Reputation: 3256

Java regex skipping matches

I have some text; I want to extract pairs of words that are not separated by punctuation. This is the code:

//n-grams
Pattern p = Pattern.compile("[a-z]+");
if (n == 2) {
    p = Pattern.compile("[a-z]+ [a-z]+");
}
if (n == 3) {
    p = Pattern.compile("[a-z]+ [a-z]+ [a-z]+");
}
Matcher m = p.matcher(text.toLowerCase());
ArrayList<String> result = new ArrayList<String>();

while (m.find()) {
    String temporary = m.group();
    System.out.println(temporary);

    result.add(temporary);
}

The problem is that it skips some matches. For example

"My name is James"

, for n = 3, must match

"my name is" and "name is james"

, but instead it matches just the first. Is there a way to solve this?

Upvotes: 2

Views: 1478

Answers (3)

felixgaal
felixgaal

Reputation: 2423

I tend to use the argument to the find() method of Matcher:

Matcher m = p.matcher(text);
int position = 0;
while (m.find(position)) { 
  String temporary = m.group();
  position = m.start();  
  System.out.println(position + ":" + temporary);
  position++;
}

So after each iteration, it searches again based on the last start index.

Hope that helped!

Upvotes: 1

Anirudha
Anirudha

Reputation: 32797

You can capture it using groups in lookahead

(?=(\b[a-z]+\b \b[a-z]+\b \b[a-z]+\b))

This causes it to capture in two groups..So in your case it would be

Group1->my name is

Group2->name is james

Upvotes: 4

Pankaj
Pankaj

Reputation: 5250

In regular expression pattern defined by regex is applied on the String from left to right and once a source character is used in a match, it can’t be reused.

For example, regex “121″ will match “31212142121″ only twice as “121___121″.

Upvotes: 1

Related Questions