Sap
Sap

Reputation: 168

WGET and HttpClient work but Jsoup doesn't work in java

I am trying to get the html source of a webpage through java code using Jsoup. Below is the code I am using to fetch the page. I am getting a 500 Internal Server Error.

  String encodedUrl = URIUtil.encodePathQuery(urlToFetch.trim(), "ISO-8859-1");
  Response res = Jsoup.connect(encodedUrl)
        .header("Accept-Language", "en")
        .userAgent(userAgent)
        .data(data)
        .maxBodySize(bodySize)
        .ignoreHttpErrors(true)
        .ignoreContentType(true)
        .timeout(10000)
        .execute();

However, when I fetch the same page with wget from command line, it works. A simple HttpClient from code also works.

// Create an instance of HttpClient.
HttpClient client = new HttpClient();

// Create a method instance.
GetMethod method = new GetMethod(url);

// Provide custom retry handler is necessary
method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, 
        new DefaultHttpMethodRetryHandler(3, false));

try {
  // Execute the method.
  int statusCode = client.executeMethod(method);

  if (statusCode != HttpStatus.SC_OK) {
    System.err.println("Method failed: " + method.getStatusLine());
  }

  // Read the response body.
  byte[] responseBody = method.getResponseBody();

  // Deal with the response.
  // Use caution: ensure correct character encoding and is not binary data
  System.out.println(new String(responseBody));

} catch (HttpException e) {
  System.err.println("Fatal protocol violation: " + e.getMessage());
  e.printStackTrace();
} catch (IOException e) {
  System.err.println("Fatal transport error: " + e.getMessage());
  e.printStackTrace();
} finally {
  // Release the connection.
  method.releaseConnection();
}  

Is there anything I would need to change in the parameters for Jsoup.connect() method for it work?

This however does not happen for all urls. It is specifically happening for pages from this website:

http://xyo.net/iphone-app/instagram-RrkBUFE/

Upvotes: 1

Views: 736

Answers (1)

fonkap
fonkap

Reputation: 2509

You need Accept header.

Try this:

    String encodedUrl = "http://xyo.net/iphone-app/instagram-RrkBUFE/";

    Response res = Jsoup.connect(encodedUrl)
            .header("Accept-Language", "en")
            .ignoreHttpErrors(true)
            .ignoreContentType(true)
            .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
            .followRedirects(true)
            .timeout(10000)
            .method(Connection.Method.GET)
            .execute();


    System.out.println(res.parse());

It works.

Please also note that the site is trying to set cookies, you may need to handle them.

Hope it will help.

Upvotes: 1

Related Questions