u_ser__
u_ser__

Reputation: 45

Remove Repeated Text

Can someone modify this Regex to remove words as in the example:

This does not work with extra in it below: (<.+?\/>)(?=\1)

<text><text>extra<words><text><words><something>

Should turn into:

<text>extra<words><something>

Thanks

Upvotes: 0

Views: 101

Answers (1)

p.s.w.g
p.s.w.g

Reputation: 149020

This is what I've come up with using lookbehinds and back references:

(<[^>]+>)(?<=\1.*\1)

This will match any instance of <tag> which is preceded by at least one other instance of the same <tag>.

For example, to use this in C#:

var input = "<text><text>extra<words><text><words><something>";
var output Regex.Replace(input, @"(<[^>]+>)(?<=\1.*\1)", "");
Console.WriteLine(output); // <text>extra<words><something>

However, this will not work in many flavors of regex. JavaScript, for example, does not support lookbehinds.

Upvotes: 1

Related Questions