johnv
johnv

Reputation: 73

Regex to Parse HTML Tables

I am trying to remove the tables within an HTML file, specifically, for the following document, I'd like to remove anything within the tags <TABLE....> and </TABLE>. The document contains multiple tables with texts in between.

The expression that I came up with, <TABLE.*>\s*[\s|\S]*</TABLE>\s*, however would remove the text in between the tables. In fact it would remove everything between the first <TABLE> and the last </TABLE> tags. I would like to keep the texts in between and only remove the tables. Any suggestion is greatly appreciated. Thanks.

====================

<TABLE STYLE=xxx, Font=yyy, etc>

table texts that should be DELETED...

</TABLE>


other texts that should be KEPT...


<TABLE STYLE=xxx, Font=yyy, etc>

table texts that should be DELETED...

</TABLE>

 ==========================================

Upvotes: 0

Views: 3057

Answers (2)

Camilo Martin
Camilo Martin

Reputation: 37898

Since I know you're not going to look at an HTML parser even if I tell you you really should, I'll just answer the question.

This matches only tables:

<table.*?>.*?</table>

It requires two options: dotall and ignoreCase.

You can try it here: http://gskinner.com/RegExr/

                              

Now do consider using HTML Agility Pack suggested by Lucero ok?

Edit: maybe this was what you meant, sorry:

                             

Upvotes: 2

Lucero
Lucero

Reputation: 60190

The answer is to use a HTML or SGML parser, there are some around for .NET:

http://htmlagilitypack.codeplex.com/

SGML parser .NET recommendations

If you absolutely want to use regular expressions, familiarize yourself with balancing groups, otherwise nested tables will break. It's not easy, and may perform much slower than a regular SGML parser. Be warned though: Seeing your expression I assume that you are a regex newbie (hint: avoid greedy . matches at any cost), so this is probably not yet your cup of tea.

Upvotes: 2

Related Questions