user1372240
user1372240

Reputation: 53

Regular expression - back reference to match exact first match

Objective

Combine subsequent strong and emphasis elements into a single element. Take the following string:

This is a <strong>test</strong><strong>string</strong>.

What I need to do is replace the two strong tags with a single tag. The above should become:

This is a <strong>teststring</strong>.

So far I have the following regular expression that fulfils this objective:

(?<values>(\<(?<tag>emphasis|strong)\>([^\<]+)\<\/\k<tag>\>){2,}?)

Problem

Take the following test string:

This is <emphasis>a</emphasis><strong>b</strong>.

It matches the first emphasis tag to the last strong tag. However, this is not the desired behaviour. What I need is for the regular expression to match strong or emphasis and then the backreference (\k<tag>) to match on the same element (a strong or emphasis). The example above will result in a match but it should not because neither the emphasis nor strong tags are repeated.

One way of solving this is to first run an expression for strong only and then another for emphasis only. However, this will result in more maintenance, additional testing, etc. so is not desirable.

Thank you for any help you can provide.

Upvotes: 1

Views: 166

Answers (1)

Andy Lester
Andy Lester

Reputation: 93745

Seems to me that what you really want to do is eliminate any closing and opening tags that are adjacent to each other.

In this:

This is a <strong>test</strong><strong>string</strong>.

You're not wanting to combine the contents of the first tag with the contents of the second tag. You just want to get rid of the </strong><strong> in the middle.

So do something like

s/<\/(\w+)><\1>//;

If you want to limit it to certain tags, do:

s/<\/(strong|emphasis)><\1>//;

(You didn't specify what language you're using so I used sed substitutions.)

Upvotes: 1

Related Questions