Reputation: 109

Regex, find pattern only in middle of string

I am using python 2.6 and trying to find a bunch of repeating characters in a string, let's say a bunch of n's, e.g. nnnnnnnABCnnnnnnnnnDEF. In any place of the string the number of n's can be variable.

If I construct a regex like this:

re.findall(r'^(((?i)n)\2{2,})', s),

I can find occurences of case-insensitive n's only in the beginning of the string, which is fine. If I do it like this:

re.findall(r'(((?i)n)\2{2,}$)', s),

I can detect the ones only in the end of the sequence. But what about just in the middle?

At first, I thought of using re.findall(r'(((?i)n)\2{2,})', s) and the two previous regex(-ices?) to check the length of the returned list and the presence of n's either in the beginning or end of the string and make logical tests, but it became an ugly if-else mess very quickly.

Then, I tried re.findall(r'(?!^)(((?i)n)\2{2,})', s), which seems to exlude the beginning just fine but (?!$) or (?!\z) at the end of the regex only excludes the last n in ABCnnnn. Finally, I tried re.findall(r'(?!^)(((?i)n)\2{2,})\w+', s) which seems to work sometimes, but I get weird results at others. It feels like I need a lookahead or lookbehind, but I can't wrap my head around them.

Upvotes: 5

Answers (3)

Wiktor Stribiżew

Reputation: 626689

NOTE: This solution assumes n may be a sequence of some characters. For more efficient alternatives when n is just 1 character, see other answers here.

You can use

(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)

See the regex demo

The regex will match repeated consecutive ns (ignoring case can be achieved with re.I flag) that are not at the beginning ((?<!^)) or end ((?!$)) of the string and not before ((?!n)) or after ((?<!n)) another n.

The (?<!^)(?<!n) is a sequence of 2 lookbehinds: (?<!^) means do not consume the next pattern if preceded with the start of the string. The (?<!n) negative lookbehind means do not consume the next pattern if preceded with n. The negative lookaheads (?!$) and (?!n)have similar meanings: (?!$) fails a match if after the current position the end of string occurs and (?!n) will fail a match if n occurs after the current position in string (that is, right after matching all consecutive ns. The lookaround conditions must all be met, that is why we only get the innermost matches.

See IDEONE demo:

import re
p = re.compile(r'(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)', re.IGNORECASE)
s = "nnnnnnnABCnnnnnNnnnnDEFnNn"
print([x.group() for x in p.finditer(s)])

Upvotes: 2

Kasravnd

Reputation: 107287

Instead of using a complicated regex in order to refuse of matching the leading and trailing n characters. As a more pythonic approach you can strip() your string then find all the sequence of ns using re.findall() and a simple regex:

>>> s = "nnnABCnnnnDEFnnnnnGHInnnnnn" 
>>> import re
>>> 
>>> re.findall(r'n{2,}', s.strip('n'), re.I)
['nnnn', 'nnnnn']

Note : re.I is Ignore-case flag which makes the regex engine matches upper case and lower case characters.

Upvotes: 3

Casimir et Hippolyte

Reputation: 89547

Since "n" is a character (and not a subpattern), you can simply use:

re.findall(r'(?<=[^n])nn+(?=[^n])(?i)', s)

or better:

re.findall(r'n(?<=[^n]n)n+(?=[^n])(?i)', s)

Upvotes: 2

Regex, find pattern only in middle of string

Answers (3)

Related Questions