Reputation: 2147
How do I parse HTML using regular expressions in C#?
For example, given HTML code
<s2> t1 </s2> <img src='1.gif' /> <span> span1 <span/>
I am trying to obtain
1. <s2>
2. t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>
How do I do this using regular expressions in C#?
In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.
Upvotes: 0
Views: 2303
Reputation: 2147
I used this regx in C#, and it works. Thanks for all your answers.
<([^<]*)>|([^<]*)
Upvotes: 0
Reputation: 369623
This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.
Upvotes: 4
Reputation: 1924
you might want to simply use string functions. make < and > as your indicator for parsing.
Upvotes: -3
Reputation: 6981
You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.
Upvotes: 3
Reputation: 24177
Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.
Upvotes: 6