How can I use .NET to traverse the directory structure of a website?

Question

Simple question.

What I want to do is access a particular directory then scan through the files looking for .html files and then download them.

I know to use WebClient.DownloadString() to copy the files, but how can I do the searching through the directories?

Jeremy Huiskamp · Accepted Answer

Http does not have directory listing/traversing as part of the spec. The best you can do is hope that the default page for a directory is a listing and then you will have to parse it looking for links to files in the same directory. There are no standards for the format of the listing but it shouldn't be too hard to pull out the href attributes of all tags and then check them for the following conditions:

no slashes, eg "file.html"

full path to same directory, eg "/the/directory/file.html" as long as you are looking at "/the/directory"

full path to the same directory on the same server, eg "http://the.server/the/directory/file.html"

If the webserver isn't giving you directory listings, you can always go for a full blown web spider approach (just parse all links in a page and visit all the ones that are on the same server and parse them, etc, and then build up your own tree structure) but many websites don't lend themselves to doing this easily.

How can I use .NET to traverse the directory structure of a website?

Answers (2)

Related Questions