Regex: how to remove redundant substring

Question

I have a string. There is redundant text at the end of this string. I want to remove all of that redundant text (both the first and second instance of the redundant text). How can I find all the repeated text at the end of a string and remove it?

In my example, I am working with a string that also has a prefix that I'm removing. So for example, I want: prefix a b c d e 123 d e 123 to return a b c

The duplicate substring can vary in length. So I would want: prefix a b c 123 c 123 to return a b

I tried matching this with

import re
re.sub(
    r'prefix ([a-z ]*)\2([a-z ]* \d*)$',
    r'\1',
    'prefix a b c 123 c 123'
)

but of course this led to a forwards reference error since I'm referring to the contents of \2 before I've created it.

I'm doing this regex in Python. 3.7.

anubhava · Accepted Answer

You may use this regex for search:

^prefix\s+(.*?)(.+?)\2+$

and use: r'\1' for replacement.

RegEx Demo

Python Code:

import re

r = re.sub(
    r'^prefix\s+(.*?)(.+?)\2+$',
    r'\1',
    'prefix a b c 123 c 123'
)
print (r)

Code Demo

RegEx Details:

^: Start
prefix\s+: Match text prefix followed by 1+ whitespaces
(.*?): Match 0 or more of any characters in capture group #1
(.+?); Match 1 or more of any characters in capture group #2
\2+: Match 1 or more repetitions of group #2
$: End

Regex: how to remove redundant substring

Answers (2)

Related Questions