Jess
Jess

Reputation: 53

Using python regex with backreference matches

I have a doubt about regex with backreference.

I need to match strings, I try this regex (\w)\1{1,} to capture repeated values of my string, but this regex only capture consecutive repeated strings; I'm stuck to improve my regex to capture all repeated values, below some examples:

import re

str = 'capitals'

re.search(r'(\w)\1{1,}', str)

Output None

import re

str = 'butterfly'

re.search(r'(\w)\1{1,}', str)

<_sre.SRE_Match object; span=(2, 4), match='tt'>

Upvotes: 3

Views: 13222

Answers (2)

Daniel Muthupandi
Daniel Muthupandi

Reputation: 69

Hope the code below will help you understand the Backreference concept of Python RegEx

There are two sets of information available in the given string str

  1. Employee Basic Info:

    • starting with @employeename and ends with employeename
    • eg: @daniel dxc chennai 45000 male daniel
  2. Employee designation

    • starting with %employeename then designation and ends with employeename%
    • eg: %daniel python developer daniel%
import re

#sample input

str="""
@daniel dxc chennai 45000 male daniel @henry infosys bengaluru 29000 male hobby- 
swimming henry
@raja zoho chennai 37000 male raja @ramu infosys bengaluru 99000 male hobby-badminton 
ramu
%daniel python developer daniel% %henry database admin henry%
%raja Testing lead raja% %ramu Manager ramu%
"""

#backreferencing employee name (\w+)  <----  \1
#----------------------------------------------
basic_info=re.findall(r'@+(\w+)(.*?)\1',str)
print(basic_info)

#(%) <-- \1  and (\w+) <--- \2 
#-------------------------------
designation=re.findall(r'(%)+(\w+)(.*?)\2\1',str)
print(designation)

for i in range(len(designation)):
    designation[i]=(designation[i][1],designation[i][2])
print(designation)

Upvotes: 2

Henry
Henry

Reputation: 151

I would use r'(\w).*\1 so that it allows any repeated character even if there are special characters or spaces in between.

However this wont work for strings with repeated characters overlapping the contents of groups like the string abcdabcd, in which it only recognizes the first group, ignoring the other repeated characters enclosed in the first group (b,c,d)

Check the demo: https://regex101.com/r/m5UfAe/1

So an alternative (and depending on your needs) is to sort the string analyzed:

import re
str = 'abcdabcde'
re.findall(r'(\w).*\1', ''.join(sorted(str)))

returning the array with the repeated characters ['a','b','c','d']

Upvotes: 6

Related Questions