Reputation: 229
I need to get information from couple of web sites . For example this site What would be the best way to get all the links from the page so that the information could be extracted. Some times need to click on a link to get other links inside that. I tried Watin and I tried doing the same from within Excel 2007 with Web Data option. Could you please suggest some better way which I am not aware of .
Upvotes: 1
Views: 688
Reputation: 10747
Ncrawler might be very useful for the deep level crawling . You could also set the MaxCrawlDepth for specifying the same.
Upvotes: 3
Reputation: 2646
I recommend to use http://watin.org/. This is much simpler than wget :-)
Upvotes: 1
Reputation: 6441
Have a look at WGet. It is an incredibly powerful tool for mining the content of a single page or an entire website. The options available allow you to dictate how many levels deep to follow in terms of links, what to do with static resources such as images, how to handle relative links, etc. It also does a very good job of mining pages which are generated dynamically, such as those served by CGI or ASP.
It's been around for many years in the 'nix world but executables compiled for Windows are readily available.
You would need to kick it off from .NET using Process.Start but you could then pipe the results into multiple files (which mimic the original website structure), a single file, or into memory by capturing standard output. Then you can do subsequent analysis such as extracting HREF HTML elements (if it is only links you are interested in) or grabbing the sort of table data evident in the link you provide in your question.
I realise this is not a 'pure' .NET solution but the power WGET offers more than compensates for this, in my opinion. I have used it myself in the past, in this way, for exactly the sort of thing I think you are trying to do.
Upvotes: 3