Reputation: 9121
I have a CMS system that allows people to also use HTML code, but a nl2br
is provided at the end of the function, which makes this:
<ul>
<li></li>
</ul>
into this:
<ul><br/>
<li></li><br/>
</ul>
Now I want to remove these <br/>
's that exist between <ul>
tags.
I already found another question which asks almost the same, but for newlines. I've integrated this inside my CMS but for one client all the content is already filled in so I have to fix this after the \n
's are replaced with <br/>
's.
The other question provides this as a regex to match \n
within <ul></ul>
:
/(?<=<ul>|<\/li>)\s*?(?=<\/ul>|<li>)/is
I'd think something like this:
/(?<=<ul>|<\/li>)(<br>|<br\/>|<br \/>)(?=<\/ul>|<li>)/is
Would do the trick, but it doesn't. What am I missing?
EDIT
I am very open for DOMDocument solutions, if there's a way to query linebreaks with xpath this would probably fix my problem.
Upvotes: 2
Views: 1568
Reputation: 9562
In the example you provided, <br>
tags are surrounded by some white-space (at least by new line characters), so this needs to be reflected in the corresponding regular expression.
/(?<=<ul>|<\/li>)(\s*<br>\s*|\s*<br\/>\s*|\s*<br \/>\s*)(?=<\/ul>|<li>)/is
In many cases regular expressions are NOT the best way for parsing HTML (I definitely agree with the comments above/below), but they are always good enough for some particular purposes.
Upvotes: 2
Reputation: 28906
There are plenty of examples on SO that demonstrate why parsing HTML with regular expressions is a bad idea, so I won't include another one here.
Instead, consider using an HTML parser such as HTMLCleaner or HTML Agility Pack to accomplish this task.
Upvotes: 0