Gideon
Gideon

Reputation: 18501

REGEX - Find td with specific class, including nested tables

I've to parse over a piece of HTML. It looks a bit like:

<table>
   <tr>
     <td class="blabla"> <table><tr><td><table><tr><td></td></tr></table></td></tr></table>
     </td>
   </tr>
  <tr>
     <td class="blabla"> <table><tr><td></td></tr></table>
     </td>
   </tr>
</table>

I need to extract each td with class blabla, but each of these cells could have 0 or more nested tables with many nested td's. I want to get

<td class="blabla"> ... many nested stuff ... </td>

Thanks

Upvotes: 0

Views: 1433

Answers (6)

Mike Caron
Mike Caron

Reputation: 5784

You can't do this merely using regular expressions because it's too complicated. Even using lookahead matching, the regex would have to dynamically change because you'd have to increment the number of </td> you're looking for based on how many <td> are found after the one you want.

Upvotes: 0

Ren&#233;
Ren&#233;

Reputation: 161

If you need to do extenisve html parsing I would recommend using the Html Agility Pack instead of regular expressions. HAP builds an xml document from an html page so you can look for specific nodes using XPath.

Upvotes: 4

Welbog
Welbog

Reputation: 60438

Don't try to parse HTML with regular expressions. You can't write an expression that will match what you want, because HTML isn't regular.

Use an HTML/XML parser in a library your language provides. System.Xml has a number of useful classes that will let you open your file and query it with XPath.

The XPath expression you're looking for is

//td[@class="someClass"]

Upvotes: 6

Xetius
Xetius

Reputation: 46864

You would be looking for a regex similar to /<td\sclass=\"(.*?)\">/, but I do not know the way to do this in .net.

However, due to the way you can badly form HTML, regex is not a good candidate for parsing. There are much better tools for doing that.

As has been mentioned, Using XPath would be quite a good way to do this using //td[@class="someClass"]. This would give you the td node. You can then get the contents of that and process it as required

Upvotes: 0

Ratnesh Maurya
Ratnesh Maurya

Reputation: 706

([tT][dD]\sclass=\"blabla\")

Upvotes: 0

rahul
rahul

Reputation: 187100

Why don't you use css selectors?

Upvotes: 1

Related Questions