Reputation: 8604
In a string that represents html markup, I need to remove all newlines that are between any <ul></ul>
. Here is an example string:
<ul>\n<li>element 1\n</li>\n<li>element 2\n</li>\n</ul><p>Hello there</p>
.
So all the \n
inside the <ul></ul>
need to be removed.
I've tried the following but it doesn't seem to be working correctly:
https://regex101.com/r/qLxSys/1
/<ul>.*?(\n)?.*?<\/ul>/
Can anybody please help me understand how I'd accomplish my goal?
Upvotes: 1
Views: 543
Reputation: 1158
To match newline between <ul>
marks you can use:
(?<=<ul>).*?(\n).*(?=<\/ul>)
Group 1 only matches one \n
character inside <ul>
.
So I propose you to replace the string iteratively by the non-matching substrings (i.e. for each \n
replace by subtrings between <ul>
and \n
on the left; between \n
and <\ul>
on the right). This implementation depends on you programming language:
In Python3:
#!python3
import re
string = "<ul>\n<li>element 1\n</li>\n<li>element 2\n</li>\n</ul>\n<p>Hello there</p>"
pattern = re.compile(r'(?<=<ul>)(.*?)(\n)(.*)(?=<\/ul>)(?su)')
while pattern.search(string):
string = pattern.sub(r'\g<1>'+r'\g<3>', string)
print(string)
In the above example, the last \n
is not replaced because it is not between <ul>
.
Another cleaner solution is to use regex to match '\n' characters after using a html parser (eg. beautifulsoup in python) to get only the <ul>
elements.
Upvotes: 1