Alex Baranosky
Alex Baranosky

Reputation: 50094

How can I use .NET to traverse the directory structure of a website?

Simple question.

How can I use .NET to traverse the directory structure of a website?

What I want to do is access a particular directory then scan through the files looking for .html files and then download them.

I know to use WebClient.DownloadString() to copy the files, but how can I do the searching through the directories?

Upvotes: 0

Views: 564

Answers (2)

zidane
zidane

Reputation: 632

You should parse downloaded files and search for <a> tags to extract links. Recursively repeat that process until you have all of needed pages downloaded.

Try special library called Html Agility Pack. This .Net library has a killing features and it

is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT.

After that it is easy to work with document and very easy to extract any information using XPATH.

Upvotes: 1

Jeremy Huiskamp
Jeremy Huiskamp

Reputation: 5294

Http does not have directory listing/traversing as part of the spec. The best you can do is hope that the default page for a directory is a listing and then you will have to parse it looking for links to files in the same directory. There are no standards for the format of the listing but it shouldn't be too hard to pull out the href attributes of all <a> tags and then check them for the following conditions:

  • no slashes, eg "file.html"
  • full path to same directory, eg "/the/directory/file.html" as long as you are looking at "/the/directory"
  • full path to the same directory on the same server, eg "http://the.server/the/directory/file.html"

If the webserver isn't giving you directory listings, you can always go for a full blown web spider approach (just parse all links in a page and visit all the ones that are on the same server and parse them, etc, and then build up your own tree structure) but many websites don't lend themselves to doing this easily.

Upvotes: 1

Related Questions