user3422317
user3422317

Reputation: 27

perl Regular expression matching repeating words

a regular expression that matches any line of input that has the same word repeated two or more times consecutively in a row. Assume there is one space between consecutive words

if($line!~m/(\b(\w+)\b\s){2,}/{print"No match\n";}
    {   print "$`";       #print out first part of string
        print "<$&>";     #highlight the matching part
        print "$'";       #print out the rest
    }

This is best i got so far,but there is something wrong correct me if i am wrong

\b start with a word boundary

(\w+) followed by one word or more words

\bend with a word boundary

\s then a space

{2,} check if this thing repeat 2 or more times

what's wrong with my expression

Upvotes: 1

Views: 4004

Answers (3)

MothraDactyl
MothraDactyl

Reputation: 193

I tried CAustin's answer in regexr.com and the results were not what I would expect. Also, no need for all the non-capturing groups.

My regex:

(\b(\w+))( \2)+

Word-boundary, followed by (1 or more word characters)[group 2], followed by one or more of: space, group 2.

This next one replaces the space with \s+, generalizing the separation between the words to be 1 or more of any kind of white-space:

(\b(\w+))(\s+\2)+

Upvotes: 1

Miller
Miller

Reputation: 35198

You aren't actually checking to see if it's the SAME word that's repeating. To do that, you need to use a captured backreference:

if ($line =~ m/\b(\w+)(?:\s\1){2,}\b/) {
     print "matched '$1'\n";
}

Also, anytime you're testing a regular expression, it's helpful if you create a list of examples to work with. The following demonstrates one way of doing that using the __DATA__ block

use strict;
use warnings;

while (my $line = <DATA>) {
    if ($line =~ m/\b(\w+)(?:\s\1){2,}/) {
        print "matched '$1'\n";
    } else {
        print "no match\n";
    }
}

__DATA__
foo foo
foo bar foo
foo foo foo

Outputs

no match
no match
matched 'foo'

Upvotes: 0

CAustin
CAustin

Reputation: 4614

This should be what you're looking for: (?:\b(\w+)\b) (?:\1(?: |$))+

Also, don't use \s when you're just looking for spaces as it's possible you'll match a newline or some other whitespace character. Simple spaces aren't delimiters or special characters in regex, so it's fine to just type the space. You can use [ ] if you want it to be more visually apparent.

Upvotes: 1

Related Questions