Reputation: 4474
I'd like to use .NET Regex to match a bit of a really broken HTML, and I am not sure how to do it.
I know that Regex is a poor tool for this job, but I only need to extract some basic text from a huge file which has some really poorly HTML markup inside and my problem seems like a piece of cake for someone good at Regex.
So, putting aside it's HTML for a moment, let's say I have this:
<span class=comment>First block with <span class=nest>nested</span> text.</span>
<stuff>
<more-badly-formatted-tags>
<td - out of nowhere>
<span class=comment>Other block with <span class=nest>nested</span> text.</span>
I'd simply like to get all contents of span
tags, along with any nested span
tags. For the example above, that would simply be:
First block with <span class=nest>nested</span> text.
Other block with <span class=nest>nested</span> text.
That's everything I need, that's why I didn't want to get into HtmlAgilityPack at all.
What I've tried so far
Naive Regex: @"<span class=comment>(<?comment>.*)</span>"
: this will greedily match everything between the first and last span
.
Lazy Regex: @"<span class=comment>(<?comment>.*?)</span>"
: this will match the first closing span
and won't work with nested tags.
Balanced: @(?<tag>\<span\b[^\>]*\>)(?<comment>.*)(?<-tag>\</span\>)"
: but obviously I don't get the syntax because this is not working.
Can anyone help me with this?
[Update]
Note that there might be newlines between these <span>
tags. Or, if you will, the whole string can be a single huge line.
Upvotes: 1
Views: 68
Reputation: 67898
I think this will get you what you want:
<span.*?>(.*)</span>
Upvotes: 0
Reputation: 4095
How about simply:
<span.*?>(.*)</span>
Working regex example:
Matches:
1. `First block with <span class=nest>nested</span> text.`
2. `Other block with <span class=nest>nested</span> text.`
Upvotes: 1