Mike108
Mike108

Reputation: 2147

How do I parse HTML using regular expressions in C#?

How do I parse HTML using regular expressions in C#?

For example, given HTML code

<s2> t1 </s2>  <img src='1.gif' />  <span> span1 <span/>

I am trying to obtain

1.  <s2>
2.  t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>

How do I do this using regular expressions in C#?

In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.

Upvotes: 0

Views: 2303

Answers (5)

Mike108
Mike108

Reputation: 2147

I used this regx in C#, and it works. Thanks for all your answers.

<([^<]*)>|([^<]*)

Upvotes: 0

J&#246;rg W Mittag
J&#246;rg W Mittag

Reputation: 369623

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.

Upvotes: 4

junmats
junmats

Reputation: 1924

you might want to simply use string functions. make < and > as your indicator for parsing.

Upvotes: -3

nickytonline
nickytonline

Reputation: 6981

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.

Upvotes: 3

bobbymcr
bobbymcr

Reputation: 24177

Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.

Upvotes: 6

Related Questions