Reputation: 21
I'm working on a number of HTML files and I'm trying to match a <p>
tag inside a <li>
inside a <ul>
For example:
<ul>
<li>1</li>
<li><p>2</p></li>
<li>
<ul>
<li><p>3</p></li>
</ul>
</li>
</ul>
My goal is to match both <p>
tags (2 and 3) separately with their nearest parent <li>
and <ul>
tags.
Here's the Regex I'm using
/<ul>.*?(<li.*?>).*?(<p.*?>.*?<\/p>)(.*?)(<\/li>)/gs
Problem happens when I try to match in an html like this:
<ul>
<li>
<ul>
<li></li>
<p>4</p>
</ul>
</li>
</ul>
It matches the <p>
tag and the further away parent <li>
and <ul>
tags.
Does anyone have an idea how I can fix this?
Edit: Assuming I need to use Regex for this matching. I might end up using selectors in JS anyway like you guys suggested, but I'd still like to know if there's an easy fix for this pattern since this logic already exists in my app using Regex.
Upvotes: 0
Views: 790
Reputation: 4170
You have been warned to use regular expressions with HTML in the comments.
They are correct, the hiararchical structure means a linear pattern can not always find your desired solution.
Assuming the HTML is valid anyway and there is only whitespace between the tags you are looking for, I have come up with this:
\s*(<li.*>)?\s*(<p.*>.*<\/p>)\s*(<\/li>)?
li
element optional but still captures it if it exists (at least in your examples).\s*
.*?
with .*
:
You do not have to write .*?
, *
already means "0 or more".You can experiment with it here:
https://regex101.com/r/oyNweY/1
Upvotes: 0
Reputation: 1263
If your goal is to fix / find bad HTML? I.e. <p>
as direct descendant of <ul>
is not allowed; hence regex, a better approach would likely be a simple parser.
If not; simplest would be something like document.createElement
+ innerHTML
+ querySelectorAll
.
If using RegExp use negated <>
as "delimiter" when matching tags, i.e:
<foo[^>]*>
// and
[^<]*
Though obviously not fool-proof. Quick and dirty for your case:
/<ul>[^<]*<li[^>]*>[^<]*<p[^>]*>([^<]*)/
| | |
| | +-- ...
| +-- not >
+-- not <
Would crash with tags inside <p>
(I.e. depends on text only inside <p> ... </p>
).
Upvotes: 1
Reputation: 1542
This is a partial answer.
The best I got to is with /<ul>.*?(<li.*?>(?:(?!<li>).)*?<p.*?>.*?<\/p>(?:(?!<\/li>).)*<\/li>)/gs
With
<ul>
<li>1</li>
<li><p>2</p></li>
<li>
<ul>
<li><p>3</p></li>
</ul>
</li>
</ul>
it gives (first one is obviously wrong)
<li>1</li> <li><p>2</p></li>
and <li><p>3</p></li>
With
<ul>
<li>
<ul>
<li></li>
<p>4</p>
</ul>
</li>
</ul>
the result is
<li>
<ul>
<li></li>
<p>4</p>
</ul>
</li>
Maybe someone can improve it further
Upvotes: 0