Nagu
Nagu

Reputation: 5114

how to read the content of a website?

i want to read the content of a website and store it in a file by using c# and asp.net. I know we can read it by using httpwebrequest. But is it possible to read the all available links data also?

Ex: suppose i want to read http://www.msn.com i can directly give the url and can read the home page data that is no issue. But here that msn.com page contains so many links in the home page I want to read those pages content also. Is it possible?

Can somebody give me a starup to do this?

Thanks in advance

Upvotes: 0

Views: 223

Answers (1)

Alex Reitbort
Alex Reitbort

Reputation: 13706

  1. define queue of urls

  2. add main page url to queue

  3. while queue is not empy

3.1 currentUrl = Dequeue()

3.2 read current url

3.3 exctarct all urls from current page using regexp.

3.4 add all urls to the queue

You will have to limit the urls in queue to some sort of depth or to some domain, otherwise you will try to download the entire internet :)

Upvotes: 1

Related Questions