Reputation: 109
I am using python 2.6 and trying to find a bunch of repeating characters in a string, let's say a bunch of n
's, e.g. nnnnnnnABCnnnnnnnnnDEF
. In any place of the string the number of n
's can be variable.
If I construct a regex like this:
re.findall(r'^(((?i)n)\2{2,})', s)
,
I can find occurences of case-insensitive n
's only in the beginning of the string, which is fine. If I do it like this:
re.findall(r'(((?i)n)\2{2,}$)', s)
,
I can detect the ones only in the end of the sequence. But what about just in the middle?
At first, I thought of using re.findall(r'(((?i)n)\2{2,})', s)
and the two previous regex(-ices?) to check the length of the returned list and the presence of n
's either in the beginning or end of the string and make logical tests, but it became an ugly if-else mess very quickly.
Then, I tried re.findall(r'(?!^)(((?i)n)\2{2,})', s)
, which seems to exlude the beginning just fine but (?!$)
or (?!\z)
at the end of the regex only excludes the last n
in ABCnnnn
. Finally, I tried re.findall(r'(?!^)(((?i)n)\2{2,})\w+', s)
which seems to work sometimes, but I get weird results at others. It feels like I need a lookahead or lookbehind, but I can't wrap my head around them.
Upvotes: 5
Views: 6881
Reputation: 626689
NOTE: This solution assumes n
may be a sequence of some characters. For more efficient alternatives when n
is just 1 character, see other answers here.
You can use
(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)
See the regex demo
The regex will match repeated consecutive n
s (ignoring case can be achieved with re.I
flag) that are not at the beginning ((?<!^)
) or end ((?!$)
) of the string and not before ((?!n)
) or after ((?<!n)
) another n
.
The (?<!^)(?<!n)
is a sequence of 2 lookbehinds: (?<!^)
means do not consume the next pattern if preceded with the start of the string. The (?<!n)
negative lookbehind means do not consume the next pattern if preceded with n
. The negative lookaheads (?!$)
and (?!n)
have similar meanings: (?!$)
fails a match if after the current position the end of string occurs and (?!n)
will fail a match if n
occurs after the current position in string (that is, right after matching all consecutive n
s. The lookaround conditions must all be met, that is why we only get the innermost matches.
See IDEONE demo:
import re
p = re.compile(r'(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)', re.IGNORECASE)
s = "nnnnnnnABCnnnnnNnnnnDEFnNn"
print([x.group() for x in p.finditer(s)])
Upvotes: 2
Reputation: 107287
Instead of using a complicated regex in order to refuse of matching the leading and trailing n
characters. As a more pythonic approach you can strip()
your string then find all the sequence of n
s using re.findall()
and a simple regex:
>>> s = "nnnABCnnnnDEFnnnnnGHInnnnnn"
>>> import re
>>>
>>> re.findall(r'n{2,}', s.strip('n'), re.I)
['nnnn', 'nnnnn']
Note : re.I
is Ignore-case flag which makes the regex engine matches upper case and lower case characters.
Upvotes: 3
Reputation: 89547
Since "n" is a character (and not a subpattern), you can simply use:
re.findall(r'(?<=[^n])nn+(?=[^n])(?i)', s)
or better:
re.findall(r'n(?<=[^n]n)n+(?=[^n])(?i)', s)
Upvotes: 2