DaFunkyAlex
DaFunkyAlex

Reputation: 1969

RegEx: remove double <br /> tags

I have a dynamic string, that may contain h2 tags and in those h2 tags some br tags. I want to remove those br tags from the string.

<h2>Headline 1</h2>Lorem ipsum dolor sit amet, consetetur sadipscing elitr.<h2>Headline 2 <br /><br /></h2>Lorem ipsum dolor sit amet, consetetur sadipscing elitr<h2>Headline 2<br /><br /></h2>Lorem ipsum dolor sit amet, consetetur sadipscing elitr<h2>Headline 2</h2>Lorem ipsum dolor sit amet, consetetur sadipscing elitr

To remove the br tags, I use this regex:

/<h2.*?>.+?(<br[\s+]?\/>).+?<\/h2>/

The problem is, that my first match is <h2>Headline 1</h2>Lorem ipsum dolor sit amet, consetetur sadipscing elitr.<h2>Headline 2 <br /><br /></h2>. Yes, works as designed :-) But how can I make regex only capture the groups with a br in the h2 tags?

Upvotes: 0

Views: 107

Answers (2)

Toto
Toto

Reputation: 91385

I suggest you to use a DOM parser.

But, if you really want to use regex, that is acceptable in this case, you can use preg_replace_callback:

$html = '<h2>Headline 1</h2>Lorem ipsum.<h2>Headline 2 <br /><br /></h2>dolor sit amet,<h2>Headline 2<br /><br /></h2>consetetur<br /> sadipscing elitr<h2>Headline 2</h2>Lorem<br /> ipsum';

# first, extract the string inside <h2>...</h>
$res = preg_replace_callback('~<h2>\K.*?(?=</h2>)~', 
            function($m) {
                # then remove the <br />
                return  preg_replace('~<br />~', '', $m[0]);
            },
            $html);

echo $res;

Output:

<h2>Headline 1</h2>Lorem ipsum.<h2>Headline 2 </h2>dolor sit amet,<h2>Headline 2</h2>consetetur<br /> sadipscing elitr<h2>Headline 2</h2>Lorem<br /> ipsum

Upvotes: 1

virolino
virolino

Reputation: 2201

It might be much easier to do it in more than 1 step:

  1. Use regex to extract the <h2>...</h2> sequence
  2. Use regex to remove the <br /> tags from the <h2>...</h2> sequence
  3. Write the new string
  4. Copy everything else as-is

Alternatively, search for:

(<\s*h2[^<]*>[^<]*)<\s*br\s*\/\s*>

and replace with:

\1

Repeat until no more replacements are done.

Test here.


The other solution (smarter) is to use a proper HTML parser and do all the magic you want.

Upvotes: 1

Related Questions