Brian G
Brian G

Reputation: 55022

How can I find repeated letters with a Perl regex?

I am looking for a regex that will find repeating letters. So any letter twice or more, for example:

booooooot or abbott

I won't know the letter I am looking for ahead of time.

This is a question I was asked in interviews and then asked in interviews. Not so many people get it correct.

Upvotes: 24

Views: 17237

Answers (11)

dland
dland

Reputation: 4419

You might want to take care as to what is considered to be a letter, and this depends on your locale. Using ISO Latin-1 will allow accented Western language characters to be matched as letters. In the following program, the default locale doesn't recognise é, and thus créé fails to match. Uncomment the locale setting code, and then it begins to match.

Also note that \w includes digits and the underscore character along with all the letters. To get just the letters, you need to take the complement of the non-alphanum, digits and underscore characters. This leaves only letters.

That might be easier to understand by framing it as the question:

"What regular expression matches any digit except 3?"
The answer is:
/[^\D3]/

#! /usr/local/bin/perl

use strict;
use warnings;

# uncomment the following three lines:
# use locale;
# use POSIX;
# setlocale(LC_CTYPE, 'fr_FR.ISO8859-1');

while (<DATA>) {
    chomp;
    if (/([^\W_0-9])\1+/) {
        print "$_: dup [$1]\n";
    }
    else {
        print "$_: nope\n";
    }
}

__DATA__
100
food
créé
a::b

Upvotes: 4

Sankar
Sankar

Reputation:

The following code will return all the characters, that repeat two or more times:

my $str = "SSSannnkaaarsss";

print $str =~ /(\w)\1+/g;

Upvotes: 3

Abdullah
Abdullah

Reputation: 998

/(.)\\1{2,}+/u

'u' modifier matching with unicode

Upvotes: 0

karakays
karakays

Reputation: 3673

I think this should also work:

((\w)(?=\2))+\2

Upvotes: 0

Joseph Pecoraro
Joseph Pecoraro

Reputation: 2876

How about:

(\w)\1+

The first part makes an unnamed group around a character, then the back-reference looks for that same character.

Upvotes: 0

hasseg
hasseg

Reputation: 6807

I Think using a backreference would work:

(\w)\1+

\w is basically [a-zA-Z_0-9] so if you only want to match letters between A and Z (case insensitively), use [a-zA-Z] instead.

(EDIT: or, like Tanktalus mentioned in his comment (and as others have answered as well), [[:alpha:]], which is locale-sensitive)

Upvotes: 9

ysth
ysth

Reputation: 98398

Just for kicks, a completely different approach:

if ( ($str ^ substr($str,1) ) =~ /\0+/ ) {
    print "found ", substr($str, $-[0], $+[0]-$-[0]+1), " at offset ", $-[0];
}

Upvotes: 2

b w
b w

Reputation: 4663

FYI, aside from RegExBuddy, a real handy free site for testing regular expressions is RegExr at gskinner.com. Handles ([[:alpha:]])(\1+) nicely.

Upvotes: 1

Keng
Keng

Reputation: 53101

I think you actually want this rather than the "\w" as that includes numbers and the underscore.

([a-zA-Z])\1+

Ok, ok, I can take a hint Leon. Use this for the unicode-world or for posix stuff.

([[:alpha:]])\1+

Upvotes: 14

Adam Bellaire
Adam Bellaire

Reputation: 110489

You can find any letter, then use \1 to find that same letter a second time (or more). If you only need to know the letter, then $1 will contain it. Otherwise you can concatenate the second match onto the first.

my $str = "Foooooobar";

$str =~ /(\w)(\1+)/;

print $1;
# prints 'o'
print $1 . $2;
# prints 'oooooo'

Upvotes: 54

Jonathan Lonowski
Jonathan Lonowski

Reputation: 123463

Use \N to refer to previous groups:

/(\w)\1+/g

Upvotes: 6

Related Questions