Reputation: 50094
Simple question.
How can I use .NET to traverse the directory structure of a website?
What I want to do is access a particular directory then scan through the files looking for .html files and then download them.
I know to use WebClient.DownloadString() to copy the files, but how can I do the searching through the directories?
Upvotes: 0
Views: 564
Reputation: 632
You should parse downloaded files and search for <a>
tags to extract links. Recursively repeat that process until you have all of needed pages downloaded.
Try special library called Html Agility Pack. This .Net library has a killing features and it
is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT.
After that it is easy to work with document and very easy to extract any information using XPATH.
Upvotes: 1
Reputation: 5294
Http does not have directory listing/traversing as part of the spec. The best you can do is hope that the default page for a directory is a listing and then you will have to parse it looking for links to files in the same directory. There are no standards for the format of the listing but it shouldn't be too hard to pull out the href attributes of all <a>
tags and then check them for the following conditions:
If the webserver isn't giving you directory listings, you can always go for a full blown web spider approach (just parse all links in a page and visit all the ones that are on the same server and parse them, etc, and then build up your own tree structure) but many websites don't lend themselves to doing this easily.
Upvotes: 1