Reputation: 13212
I am currently trying to do a screen scrape using the following code:
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
HttpWebResponse theResponse = (HttpWebResponse) request.GetResponse();
using (StreamReader reader = new StreamReader(theResponse.GetResponseStream(), Encoding.UTF8))
{
string s = reader.ReadToEnd();
}
However, the data I am concerned with (an HTML table) is not part of the result. When I right click the page and ViewSource, I also do not see the HTML table I care about - however I do see it in the DOM when I use Firebug to inspect it.
It doesn't seem to be loaded via ajax either.
So - is there another way, using C#, to get the DOM as it exists in the Developer Tool view, rather than the ViewSource result?
Unfortunately, this page is not publicly available so I can't paste the URL.
Upvotes: 2
Views: 524
Reputation: 571
Have you used Fiddler or Ethereal to see what URL's are being connected to in the background? If you find the HTML table in the response from one of the URL's called in the background, you can scrape the data from that URL. Which URL/table are you trying to parse?
Upvotes: 0
Reputation: 1038930
It doesn't seem to be loaded via ajax either.
You don't need to use AJAX in order to dynamically add data to the DOM. You could perfectly fine use standard javascript.
To scrape such page you need a scraper that processes javascript. The WebBrowser control in WinForms does that. It allows you to load a web page and explore the DOM, just as you do in FireBug (except that the snapshot comes from IE because the WebBrowser is just a wrapper around IE).
But since the WebBrowser control is not designed to be used in a multithreaded environment (such as a web application) you will have to use a third party library to achieve that scraping task.
Upvotes: 2