Reputation: 1208
I have written simple to code to get the content-type
of a given URL. To make the processing faster, I made a change to set the request method as HEAD
// Added a random puppy face picture here
// On entering this query in browser (or Poster<mozilla> or Postman<chrome>), the
// content type is shown as image/jpeg
URL url = new URL("http://www.bubblews.com/assets/images/news/521013543_1385596410.jpg");
HttpURLConnection connection = (HttpURLConnection) url
.openConnection();
connection.setRequestMethod("HEAD");
connection.connect();
String contentType = connection.getContentType();
System.out.println(contentType);
if (!contentType.contains("text/html")) {
System.out.println("NOT TEXT/HTML");
// Do something
}
I am trying to achieve something if it is not text/html
, but when I set the request method as HEAD
, the content-type is shown as text/html
. If I fire the same HEAD
request using Poster
or Postman
, I see the content-type
as image/jpeg
.
So what is it that makes the content-type change in case of this Java code?. Can someone please point out any mistake that I may have made?
Note: I used this post as reference
Upvotes: 2
Views: 3750
Reputation: 13007
You should probably add an Accept
header and/or User-Agent
header.
Most web servers deliver different content depending on headers set by the client (e.g. web browser, Java HttpURLConnection, curl, ...). This is especially true for Accept
, Accept-Encoding
, Accept-Language
, User-Agent
, Cookie
and Referer
.
As an example, a web-server might refuse to deliver an image, if the Referer
header does not link to an internal page.
In your case, the web-server doesn't deliver images if it seems like some robot is crawling it. So if you fake your request like if it's coming from a web-browser, the server might deliver it.
When crawling web-sites, you should respect robots.txt
(because you act like a robot). So strictly speaking you should be careful when faking User-Agent
when doing a lot of requests or create a big business out of this. I don't know how big web-sites react on such behavior, especially when someone by-passes there business...
Please don't see this as a telling-off. I just wanted to point you to this, so you don't run into trouble. Maybe it's not a problem at all, YMMV.
Upvotes: 1