Reputation: 3141
I'm working on a C# app. What is the best way to scrape source code from a webpage?
Right now, I am just viewing the page source in my browser (Chrome), copying & pasting it into a text file, and sucking it into a parser.
I was thinking I'd first create a textbox in my application where I'd be able to paste a URL. The application would then pull that page's source code and then pass it into my parser.
Upvotes: 0
Views: 1202
Reputation: 4864
I'd consider the HtmlAgilityPack. You can easily download a page like this:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(new WebClient().DownloadString("http://www.bing.com"));
If you are looking for a good parser, as well, I've had good experience with ScrapySharp which adds extension methods to HtmlAgilityPack's HtmlDocument to easily select elements on the page using CssSelectors like you'd find in jQuery, like this:
document.DocumentNode.CssSelect(".sessions .main-head-row td.download a.text-pdf")
Upvotes: 2