Reputation: 1865
This is my question. Which is the best way to extract certain information from an HTML page. What I currently do is the following:
Download the page using WebClient
Convert the received data to string using UTF8Encoding
Convert the string to XML
Using Xml related classes from the .NET Framework extract the desired data
This is what I currently do in summarized form. Anyone aware of another method? Something that can be faster or easier?
Best Regards, Kiril
PS: I have heard about a testing framework called Watin
that allows you to do something similar, but haven't researched it much
Upvotes: 0
Views: 559
Reputation: 56853
This could be simplified slightly, by using the WebClient.DownloadString method I believe.
See other answers for details on the parsing, as I haven't tried the HTML Agility Pack.
Upvotes: 0
Reputation: 14868
For your parsing needs I recommend the HTML Agility Pack.
For actually retrieving the HTML, use the WebRequest class
Upvotes: 2
Reputation: 144112
It sounds like you've figured out how to fetch the page data (that's the simplest part).
For the rest, the best managed library I've used for this type of task is the HTML Agility Pack. It's open source and very mature, written entirely in .NET. It handles malformed HTML and can do what you need in two different ways:
Natively supports XPATH and XML-like querying against the HTML DOM. It is designed to mimic .NET's XML library, so anything you can do against XML with .NET, you can do against HTML with this.
Supports producing valid XML from the HTML, so you can use any XML tools.
Upvotes: 5
Reputation: 4201
Unless you are working with perfectly formed XHTML Regular expressions will be more suitable for parsing the html?
Watin allows you to script button clicks, script calls etc on a web page through IE (can it use other browsers not sure?). I dont think this will accomplish what you are looking for.
Upvotes: 0