foglerek
foglerek

Reputation: 178

Matching one string multiple times using regex in Java

I'm having some issues with making the following regex work. I would like the following string:

"Please enter your name here"

to result in an array with the following elements:

'please enter', 'enter your', 'your name', 'name here'

Currently, I'm using the following pattern, and then creating a matcher and iterating in the following way:

Pattern word = Pattern.compile("[\w]+ [\w]+");
Matcher m = word.matcher("Please enter your name here");

while (m.find()) {
    wordList.add(m.group());
}

But the result I'm getting is:

'please enter', 'your name'

What am I doing wrong? (P.s., i checked the same regex on regexpal.com and had the same problem). It seems like the same word won't be matched twice. What can I do to achieve the result I want?

Thanks.

---------------------------------

EDIT: Thanks for all the suggestions! I ended up doing this (because it adds flexibility in being able to easily specify number of "n-grams"):

Integer nGrams = 2;
String patternTpl = "\\b[\\w']+\\b";
String concatString = "what is your age? please enter your name."
for (int i = 0; i < nGrams; i++) {
    // Create pattern.
    String pattern = patternTpl;
    for (int j = 0; j < i; j++) {
        pattern = pattern + " " + patternTpl;
    }
    pattern = "(?=(" + pattern + "))";
    Pattern word = Pattern.compile(pattern);
    Matcher m = word.matcher(concatString);

    // Iterate over all words and populate wordList
    while (m.find()) {
        wordList.add(m.group(1));
    }
}

This results in:

Pattern: 
(?=(\b[\w']+\b)) // In the first iteration
(?=(\b[\w']+\b \b[\w']+\b)) // In the second iteration

Array:
[what, is, your, age, please, enter, your, name, what is, is your, your age, please enter, enter your, your name]

Note: Got the pattern from the following top answer: Java regex skipping matches

Upvotes: 5

Views: 13195

Answers (4)

Josh M
Josh M

Reputation: 11947

If you want to avoid using such specific RegEx, perhaps you should try a simpler, and more easier, solution:

public static String[] array(final String string){
    final String[] words = string.split(" ");
    final String[] array = new String[words.length-1];
    for(int i = 0; i < words.length-1; i++)
        array[i] = String.format("%s %s", words[i], words[i+1]);
    return array;
}

public static void main(String args[]){
    final String[] array = array("Please enter your name here");
    System.out.println(Arrays.toString(array));
}

The output is:

[Please enter, enter your, your name, name here]

Upvotes: 1

ajb
ajb

Reputation: 31699

Something like:

Pattern word = Pattern.compile("(\\w+) ?");
Matcher m = word.matcher("Please enter your name here");

String previous = null;
while (m.find()) {
    if (previous != null)
        wordList.add(previous + m.group(1));
    previous = m.group();
}

The pattern ends with an optional space (which matches if there are more spaces in the string). m.group() returns the entire match, with the space; m.group(1) returns just the word, without the space.

Upvotes: 0

arshajii
arshajii

Reputation: 129557

The matches can't overlap, which explains your result. Here's a potential workaround, making use of capturing groups with a positive lookahead:

Pattern word = Pattern.compile("(\\w+)(?=(\\s\\w+))");
Matcher m = word.matcher("Please enter your name here");

while (m.find()) {
    System.out.println(m.group(1) + m.group(2));
}
Please enter
enter your
your name
name here

Upvotes: 9

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477598

You're not doing anything wrong. It's just the way a regex works (otherwise matching would become O(n^2), since regex matching is done in linear time, this cannot be processed).

In this case you could simply search for [\w]+. And postprocess these groups.

Upvotes: 0

Related Questions