nalzok
nalzok

Reputation: 16107

Matching same characters in a row using regex

I want to match "three uppercase letters, one lowercase letters, and three uppercase letters" using regular expression. What makes this difficult is that adjacent uppercase letters must be same. For example, I expect AAAbCCC, but not AAAbCCD or ABAbCDC.

Here is what I've tried:

print(re.findall("[A-Z]{3}[a-z][A-Z]{3}", l))

However, this is not what I want, because it matches AAAbCCD and ABAbCDC as well.

Upvotes: 2

Views: 3013

Answers (3)

Eugene Yarmash
Eugene Yarmash

Reputation: 149736

You can use capture groups and backreferences:

re.findall(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", string)

Note, however, that in the presence of groups in the pattern re.findall() will return the groups instead of matches. So to get the matched strings you'll need to enclose the whole pattern in parentheses and take the 1st group:

>>> s = "AAAbCCC AAAbCCD"
>>> [groups[0] for groups in re.findall(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", s)]
['AAAbCCC']

You can also use re.finditer(), which returns an iterator over the match objects:

>>> [match.group(1) for match in re.finditer(r"(([A-Z])\2\2[a-z]([A-Z])\3\3)", s)]
['AAAbCCC']

Upvotes: 2

Tryph
Tryph

Reputation: 6209

You can use ([A-Z])\1{2}[a-z]([A-Z])\2{2}.

It stores the first found upercase character in a group and reuse it with \1 (and \2) to check the two following chars.

Upvotes: 3

heemayl
heemayl

Reputation: 41987

Leverage captured grouping:

^([A-Z])\1\1[a-z]([A-Z])\2\2$

Demo

  • ^([A-Z]) captures the first uppercase, and put in captured group 1, \1\1 matches next two characters if they are same as the captured one. same goes for the second captured one, later referenced by \2

You can use range matching, {}:

^([A-Z])\1{2}[a-z]([A-Z])\2{2}$

Upvotes: 3

Related Questions