Kartik Anand
Kartik Anand

Reputation: 4609

Python Regex matching already matched sub-string

I'm fairly new to Python Regex and I'm not able to understand the following:

I'm trying to find one small letter surrounded by three capital letters.

My first problem is that the below regex is giving only one match instead of the two matches that are present ['AbAD', 'DaDD']

>>> import re
>>> 
>>> # String
... str = 'AbADaDD'
>>> 
>>> pat = '[A-Z][a-z][A-Z][A-Z]'
>>> regex = re.compile(pat)
>>> 
>>> print regex.findall(str)
['AbAD']

I guess the above is due to the fact that the last D in the first regex is not available for matching any more? Is there any way to turn off this kind of matching.

The second issue is the following regex:

>>> import re
>>> 
>>> # String
... str = 'AbADaDD'
>>> 
>>> pat = '[^A-Z][A-Z][a-z][A-Z][A-Z][^A-Z]'
>>> regex = re.compile(pat)
>>> 
>>> print regex.findall(str)
[]

Basically what I want is that there shouldn't be more than three capital letters surrounding a small letter, and therefore I placed a negative match around them. But ['AbAD'] should be matched, but it is not getting matched. Any ideas?

Upvotes: 1

Views: 435

Answers (3)

Jaykumar Patel
Jaykumar Patel

Reputation: 27604

1st issue,

You should use this pattern,

 r'([A-Z]{1}[a-z]{1}[A-Z]{1})'

Example

>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'([A-Z]{1}[a-z]{1}[A-Z]{1})', str)
['AbA', 'DaD']

2nd issue

You should use,

(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))

Example

>>> import re
>>> str = 'AbADaDD'
>>> re.findall(r'(?=(?<![A-Z])([A-Z]{1}[a-z]{1}[A-Z]{1}[A-Z]{1})(?![A-Z]))', str)
['AbAD']

Upvotes: 0

vks
vks

Reputation: 67968

The problem with you regex is tha it is eating up the string as it progresses leaving nothing for second match.Use lookahead to make sure it does not eat up the string.

pat = '(?=([A-Z][a-z][A-Z][A-Z]))'

For your second regex again do the same.

print re.findall(r"(?=([A-Z][a-z][A-Z][A-Z](?=[^A-Z])))",s)

.For more insights see

1)After first match the string left is aDD as the first part has matched.

2)aDD does not satisfy pat = '[A-Z][a-z][A-Z][A-Z]'.So it is not a part of your match.

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174696

It's mainly because of the overlapping of matches. Just put your regex inside a lookahead inorder to handle this type of overlapping matches.

(?=([A-Z][a-z][A-Z][A-Z]))

Code:

>>> s = 'AbADaDD'
>>> re.findall(r'(?=([A-Z][a-z][A-Z][A-Z]))', s)
['AbAD', 'DaDD']

DEMO

For the 2nd one, you should use negative lookahead and lookbehind assertion like below,

(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))

Code:

>>> re.findall(r'(?=(?<![A-Z])([A-Z][a-z][A-Z][A-Z])(?![A-Z]))', s)
['AbAD']

DEMO

The problem with your second regex is, [^A-Z] consumes a character (there isn't a character other than uppercase letter exists before first A) but the negative look-behind (?<![A-Z]) also do the same but it won't consume any character . It asserts that the match would be preceded by any but not of an uppercase letter. That;s why you won't get any match.

Upvotes: 1

Related Questions