Reputation: 5
Before asking my question (which is basically what the title says) I want to provide some background, so as to give a better knowledge about my situation.
I am writing a little application in Java, mainly for academic purposes, but also with a very specific task in mind. What this application does is basically build an URL hierarchy starting from a base URL, and later on give the ability to organize the links and perform some actions on them.
Imagine the following URLs:
http://www.example.com
http://www.example.com/sub001
http://www.example.com/sub002
http://www.example.com/sub002/ultrasub
I would like my program to retrieve this hierarchy when provided with the base URL http://www.example.com
(or http://www.example.com/
).
In my code I have a class capable of encoding URLs and I have already thought of a way to validate them, I just couldn't find a way to find out the URL hierarchy beneath the base URL.
Is there a direct way of doing it, or do I just have to download the files from the base URL and start building the hierarchy from the relative and absolute links present in the file?
I am not asking for specific code, just a (somewhat) complete explanation of what way I could take to do it, with maybe some skeleton code to guide me.
Also, I am storing the URLs in a TreeMap<URL,Boolean>
structure, in which the Boolean
states if the URL has already been analyzed or not. I chose this structure after a quick peek in the Java 7 API specification, but do you suggest any structure that's better for this specific purpose?
Thanks in advance :)
Upvotes: 0
Views: 637
Reputation: 17697
There is no way in the HTTP protocol to request all the URL's that are 'under' a given URL. You are out of luck.
Some protocols (ftp://... for example) do have explicit mechanisms.....
Some HTTP Servers will print an index page if you request a 'directory' but this practice is not recommended and not many servers will do that.
Bottom line is that you have to follow links in order to determine what the server hierarchy is, and even then you may not discover a link to all the areas of the hierarchy.
EDIT: I should add that you should, as a well-behaved nettizen, obey the robots.txt file on any servers you access....
EDIT2: (after comment on FTP mechanism)
The FTP protocol has many commands: See this wiki list. One of the commands is: NLIST
which "Returns a list of file names in a specified directory."
The URL specification makes special provision in the URL format for FTP protocol URL's, and in section 3.2.2 :
The url-path of a FTP URL has the following syntax:
<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>
....
If the typecode is "d", perform a NLST (name list) command with as the argument, and interpret the results as a file directory listing.
I can see the effects when I try this from the commandline (not from a browser):
rolf@home ~ $ curl 'ftp://sunsite.unc.edu/README'
Welcome to ftp.ibiblio.org, the public ftp server of ibiblio.org. We
hope you find what you're looking for.
If you have any problems or questions, please see
http://www.ibiblio.org/help/
Thanks!
and type=d
I get:
rolfl@home ~ $ curl 'ftp://sunsite.unc.edu/README;type=d'
HEADER.images
incoming
HEADER.html
pub
unc
README
Upvotes: 1