rakoonise
rakoonise

Reputation: 539

How can I find overlapping sets of words with regex expression?

Right now I have a regex expression that looks like "\\w+ \\w+" to find 2-word phrases, however, they do not overlap. For example, if my sentence was The dog ran inside, the output would show "The dog", "ran inside" when I need it to show "The dog", "dog ran", "ran inside". I know there's a way to do this but I'm just way too new to using regex expressions to know how to do this.

Thanks!

Upvotes: 2

Views: 620

Answers (4)

Tim Pietzcker
Tim Pietzcker

Reputation: 336198

You can do this with a lookahead, a capturing group and a word boundary anchor:

Pattern regex = Pattern.compile("\\b(?=(\\w+ \\w+))");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group(1));
} 

Upvotes: 1

cl-r
cl-r

Reputation: 1264

The easy (and faster for big String) way is to use split :

    final String[] arrStr = "The dog ran inside".split(" ");
    for (int i = 0, n = arrStr.length - 1; i < n; i++) {
        System.out.format("%s %s%n", arrStr[i], arrStr[i + 1]);
    }

out put

The dog
dog ran
ran inside

No found trick with regex

Upvotes: 0

ikegami
ikegami

Reputation: 385897

Use a lookahead to get the second word, the concatenate the non-lookahead with the lookahead part.

# This is Perl. The important bits:
#
# $1 is what the first parens captured.
# $2 is what the second parens captured.
# . is the concatenation operator (like Java's "+").

while (/(\w+)(?=(\s+\w+))/g) {
   my $phrase = $1 . $2; 
   ...
}

Sorry, don't know enough Java, but this should be easy enough to do in Java too.

Upvotes: 0

midgetspy
midgetspy

Reputation: 689

This is not possible purely with regex, you can't match the same characters twice ("dog" can't be in two separate groups). Something like this doesn't need regex at all, you can simply split the string by spaces and combine it however you like:

>>> words = "The dog ran inside".split(" ")
>>> [" ".join(words[i:i+2]) for i in range(len(words)-1)]
['The dog', 'dog ran', 'ran inside']

If that doesn't solve your problem please provide more details about what exactly you're trying to accomplish.

Upvotes: 0

Related Questions