Lou
Lou

Reputation: 4474

Matching content with nested tags

I'd like to use .NET Regex to match a bit of a really broken HTML, and I am not sure how to do it.

I know that Regex is a poor tool for this job, but I only need to extract some basic text from a huge file which has some really poorly HTML markup inside and my problem seems like a piece of cake for someone good at Regex.

So, putting aside it's HTML for a moment, let's say I have this:

<span class=comment>First block with <span class=nest>nested</span> text.</span>
<stuff>
<more-badly-formatted-tags>
<td - out of nowhere>
<span class=comment>Other block with <span class=nest>nested</span> text.</span>

I'd simply like to get all contents of span tags, along with any nested span tags. For the example above, that would simply be:

First block with <span class=nest>nested</span> text.
Other block with <span class=nest>nested</span> text.

That's everything I need, that's why I didn't want to get into HtmlAgilityPack at all.

What I've tried so far

  1. Naive Regex: @"<span class=comment>(<?comment>.*)</span>": this will greedily match everything between the first and last span.

  2. Lazy Regex: @"<span class=comment>(<?comment>.*?)</span>": this will match the first closing span and won't work with nested tags.

  3. Balanced: @(?<tag>\<span\b[^\>]*\>)(?<comment>.*)(?<-tag>\</span\>)": but obviously I don't get the syntax because this is not working.

Can anyone help me with this?

[Update]

Note that there might be newlines between these <span> tags. Or, if you will, the whole string can be a single huge line.

Upvotes: 1

Views: 68

Answers (2)

Mike Perrenoud
Mike Perrenoud

Reputation: 67898

I think this will get you what you want:

<span.*?>(.*)</span>

Regular expression visualization

Debuggex Demo

Upvotes: 0

Bryan Elliott
Bryan Elliott

Reputation: 4095

How about simply:

<span.*?>(.*)</span>

Working regex example:

http://regex101.com/r/bX3gU2

Matches:

1.  `First block with <span class=nest>nested</span> text.`

2.  `Other block with <span class=nest>nested</span> text.`

Upvotes: 1

Related Questions