Eric Wilson
Eric Wilson

Reputation: 59365

How do you determine if a file is html from the URL?

Given a URL, how can you tell if the referenced file is and html file?

Obviously, its an html file if it ends in .html or /, but then there are .jsp files, too, so I'm wondering what other extensions may be out there for html.

Alternatively, if this information can be easily gained from a URL object in Java, that would be sufficient for my purposes.

Upvotes: 3

Views: 5183

Answers (7)

David Rabinowitz
David Rabinowitz

Reputation: 30448

Just from the URL you cannot, think of the following urls:

All of them return HTML content. The only sure way is to ask the server for the resource, and check the Content-TYpe header. It is better to use to send an HEAD request to the server, instead of GET or POST - it will give you just the headers and without the content.

  URL url = ...
  HttpURLConnection urlc = (HttpURLConnection)url.openConnection();
  urlc.setAllowUserInteraction( false );
  urlc.setDoInput( true );
  urlc.setDoOutput( false );
  urlc.setUseCaches( true );
  urlc.setRequestMethod("HEAD");
  urlc.connect();
  String mime = urlc.getContentType();
  if(mime.equals("text/html") {
    // do your stuff
  }

Upvotes: 10

Silent Warrior
Silent Warrior

Reputation: 5265

You can't. Sometimes some URL ends with .html extension, but it actually not a html files. Like in spring actions I normally use extension .html, so it looks like html file from url, but it is not. So practically you can't determine it.

Upvotes: 0

Brandon Yarbrough
Brandon Yarbrough

Reputation: 38379

Fundamentally, a URL is merely an address. There are plenty of useful, meaningful conventions that you can use to decipher what they might contain, but when it comes down to it, a webserver is free to return any type of thing it wants for a given URL. Not even querying the server, asking for what comes back, and examining it is a 100% surefire way of knowing what sort of file it is. The server could easily change what sort of file it points to based on the request, or the time or day, or the whims of its owner.

There are some good basic guidelines that will work most of the time, but I hesitate to even mention them because they're absolutely not reliable.

There is some good news, though. If you actually request the data from the server, it will, just as some other answers point out, tell you precisely what sort of thing it is providing you with (for this particular exchange). It'll give you a MIME-Type in the field named "Content-Type". If it's text/html, then you have yourself an html document (not an image, not an xhtml document, HTML).

Upvotes: 4

Artem Barger
Artem Barger

Reputation: 41232

HTML - Hyper Text Markup Language, that means html is a standard, referencing *.html meaning there is static HTML page all, other *.jsp, *.php, *.asp and etc, They generates dynamic html. So you cannot find out, you can try to look on content-type, but this way you still will miss some pages.

Upvotes: 1

jitter
jitter

Reputation: 54605

Put simply. You can't.

There are REST-style URL's like

http://yourserver.com/service/givemehtml/

which serve you html.

Upvotes: 2

caskey
caskey

Reputation: 12695

You can not. There is nothing wrong with serving up html files with urls that end in .jpeg, or .gif or even .mp3. The only way to know is to fetch the url and view the Content-Type header to see if it is text/html (but that isn't even 100% accurate because of poorly configured web servers).

Upvotes: 7

Ben Hughes
Ben Hughes

Reputation: 14195

you can't. but you can ask the server for headers and check the content type to see if it is text/html.

Upvotes: 20

Related Questions