Kiril Stanoev
Kiril Stanoev

Reputation: 1865

Suggestion Needed: Best way of parsing HTML in C#

This is my question. Which is the best way to extract certain information from an HTML page. What I currently do is the following:

  1. Download the page using WebClient

  2. Convert the received data to string using UTF8Encoding

  3. Convert the string to XML

  4. Using Xml related classes from the .NET Framework extract the desired data

This is what I currently do in summarized form. Anyone aware of another method? Something that can be faster or easier?

Best Regards, Kiril

PS: I have heard about a testing framework called Watin

that allows you to do something similar, but haven't researched it much

Upvotes: 0

Views: 559

Answers (4)

samjudson
samjudson

Reputation: 56853

This could be simplified slightly, by using the WebClient.DownloadString method I believe.

See other answers for details on the parsing, as I haven't tried the HTML Agility Pack.

Upvotes: 0

Kirschstein
Kirschstein

Reputation: 14868

For your parsing needs I recommend the HTML Agility Pack.

For actually retrieving the HTML, use the WebRequest class

Upvotes: 2

Rex M
Rex M

Reputation: 144112

It sounds like you've figured out how to fetch the page data (that's the simplest part).

For the rest, the best managed library I've used for this type of task is the HTML Agility Pack. It's open source and very mature, written entirely in .NET. It handles malformed HTML and can do what you need in two different ways:

  • Natively supports XPATH and XML-like querying against the HTML DOM. It is designed to mimic .NET's XML library, so anything you can do against XML with .NET, you can do against HTML with this.

  • Supports producing valid XML from the HTML, so you can use any XML tools.

Upvotes: 5

alexmac
alexmac

Reputation: 4201

Unless you are working with perfectly formed XHTML Regular expressions will be more suitable for parsing the html?

Watin allows you to script button clicks, script calls etc on a web page through IE (can it use other browsers not sure?). I dont think this will accomplish what you are looking for.

Upvotes: 0

Related Questions