Reputation: 59365
Given a URL, how can you tell if the referenced file is and html file?
Obviously, its an html file if it ends in .html or /, but then there are .jsp files, too, so I'm wondering what other extensions may be out there for html.
Alternatively, if this information can be easily gained from a URL object in Java, that would be sufficient for my purposes.
Upvotes: 3
Views: 5183
Reputation: 30448
Just from the URL you cannot, think of the following urls:
All of them return HTML content. The only sure way is to ask the server for the resource, and check the Content-TYpe header. It is better to use to send an HEAD request to the server, instead of GET or POST - it will give you just the headers and without the content.
URL url = ...
HttpURLConnection urlc = (HttpURLConnection)url.openConnection();
urlc.setAllowUserInteraction( false );
urlc.setDoInput( true );
urlc.setDoOutput( false );
urlc.setUseCaches( true );
urlc.setRequestMethod("HEAD");
urlc.connect();
String mime = urlc.getContentType();
if(mime.equals("text/html") {
// do your stuff
}
Upvotes: 10
Reputation: 5265
You can't. Sometimes some URL ends with .html extension, but it actually not a html files. Like in spring actions I normally use extension .html, so it looks like html file from url, but it is not. So practically you can't determine it.
Upvotes: 0
Reputation: 38379
Fundamentally, a URL is merely an address. There are plenty of useful, meaningful conventions that you can use to decipher what they might contain, but when it comes down to it, a webserver is free to return any type of thing it wants for a given URL. Not even querying the server, asking for what comes back, and examining it is a 100% surefire way of knowing what sort of file it is. The server could easily change what sort of file it points to based on the request, or the time or day, or the whims of its owner.
There are some good basic guidelines that will work most of the time, but I hesitate to even mention them because they're absolutely not reliable.
There is some good news, though. If you actually request the data from the server, it will, just as some other answers point out, tell you precisely what sort of thing it is providing you with (for this particular exchange). It'll give you a MIME-Type in the field named "Content-Type". If it's text/html, then you have yourself an html document (not an image, not an xhtml document, HTML).
Upvotes: 4
Reputation: 41232
HTML - Hyper Text Markup Language, that means html is a standard, referencing *.html meaning there is static HTML page all, other *.jsp, *.php, *.asp and etc, They generates dynamic html. So you cannot find out, you can try to look on content-type, but this way you still will miss some pages.
Upvotes: 1
Reputation: 54605
Put simply. You can't.
There are REST-style URL's like
http://yourserver.com/service/givemehtml/
which serve you html.
Upvotes: 2
Reputation: 12695
You can not. There is nothing wrong with serving up html files with urls that end in .jpeg, or .gif or even .mp3. The only way to know is to fetch the url and view the Content-Type header to see if it is text/html (but that isn't even 100% accurate because of poorly configured web servers).
Upvotes: 7
Reputation: 14195
you can't. but you can ask the server for headers and check the content type to see if it is text/html.
Upvotes: 20