MarksCode
MarksCode

Reputation: 8604

Match all newlines between two tags

In a string that represents html markup, I need to remove all newlines that are between any <ul></ul>. Here is an example string:

<ul>\n<li>element 1\n</li>\n<li>element 2\n</li>\n</ul><p>Hello there</p>.

So all the \n inside the <ul></ul> need to be removed.

I've tried the following but it doesn't seem to be working correctly:

https://regex101.com/r/qLxSys/1

/<ul>.*?(\n)?.*?<\/ul>/

Can anybody please help me understand how I'd accomplish my goal?

Upvotes: 1

Views: 543

Answers (1)

mquantin
mquantin

Reputation: 1158

To match newline between <ul> marks you can use: (?<=<ul>).*?(\n).*(?=<\/ul>)

Group 1 only matches one \n character inside <ul>. So I propose you to replace the string iteratively by the non-matching substrings (i.e. for each \n replace by subtrings between <ul> and \n on the left; between \n and <\ul> on the right). This implementation depends on you programming language:

In Python3:

#!python3
import re
string = "<ul>\n<li>element 1\n</li>\n<li>element 2\n</li>\n</ul>\n<p>Hello there</p>"
pattern = re.compile(r'(?<=<ul>)(.*?)(\n)(.*)(?=<\/ul>)(?su)')
while pattern.search(string):
    string = pattern.sub(r'\g<1>'+r'\g<3>', string)
print(string)

In the above example, the last \n is not replaced because it is not between <ul>.

Another cleaner solution is to use regex to match '\n' characters after using a html parser (eg. beautifulsoup in python) to get only the <ul> elements.

Upvotes: 1

Related Questions