Reputation: 63
I have a string in .net.
<p class='p1'>Para 1</p><p>Para 2</p><p class="p2">Para 3</p><p>Para 4</p>
Now, I want to get only text inside the tag p (Para 1, Para 2, Para 3, Para4).
I used the following regular expression but it doesn't give me expected result.
(?<=<p.*>).*?(?=</p>)
If I use (?<=<p>).*?(?=</p>)
it will give Para 2 and Para 4 which both p tags doesn't have class attribute?
I'd like to know what's wrong with (?<=<p.*>).*?(?=</p>)
that code.
Upvotes: 2
Views: 139
Reputation: 336158
Let's illustrate this using RegexBuddy:
Your regex matches more than you think - the dot matches any character, so it doesn't care about tag boundaries.
What it is actually doing:
(?<=<p.*>)
: Assert that there is <p
(followed by any number of characters) anywhere in the string before the current position, followed by a >
..*?
: Match any number of characters...(?=</p>)
: ...until the next occurence of </p>
.Your question is a bit unclear, but if your plan is to find text within <p>
tags regardless of whether they contain any attributes, you shouldn't be using regular expressions anyway but a DOM parser, for example the HTML agility pack.
That said, if you insist on a regex, try
(?<=<p[^<>]*>)(?:(?!</p>).)*
Explanation:
(?<=<p[^<>]*>) # Assert position right after a p tag
(?:(?!</p>).)* # Match any number of characters until the next </p>
Upvotes: 5
Reputation: 11788
Have you tried using following expression?
<p[\s\S]*?>(?<text_inside_p>[\s\S]*?)</p>
group named text_inside_p
will contain desired text.
Upvotes: 1