Evya
Evya

Reputation: 2375

jsoup request returns wrong status code

I have a Twitter shortened URL (t.co) and I'm trying to use jsoup to send a request and parse its response. There should be three redirect hops before reaching the final URL. This is not the case when using jsoup, even after setting followRedirects to true.

My code:

public static void main(String[] args) {
    try {
        Response response = Jsoup.connect("https://t. co/sLMy6zi4Yw").followRedirects(true).execute(); // Space intentional to avoid SOF shortened errors
        System.out.println(response.statusCode()); // prints 200
    } catch (IOException e) {
        System.out.println(e.getMessage());
    }
}

However, using Python's Request library, I can get the right response:

response = requests.get('https://t. co/sLMy6zi4Yw', allow_redirects=False)
print(response.status_code)

301

I'm using jsoup version 1.11.2 and Requests version 2.18.4 with Python 3.5.2.

Anybody have any insight on the matter?

Upvotes: 2

Views: 750

Answers (1)

James W.
James W.

Reputation: 3055

To overcome this special case you can remove the User-Agent header which Jsoup sets by default (for some unknown/undocument reason)

    Connection connection = Jsoup.connect(url).followRedirects(true);
    connection.request().removeHeader("User-Agent");

Let's examine the raw requests & view the server behavior

Request with user agent (to simulate a browser) returns

  • status code 200
  • Meta refresh which is a method of instructing a web browser to automatically refresh the current web page or frame after a given time interval, this case 0 seconds and url http://bit. ly/2n3VDpo
  • Javascript code which replaces location to the same url (google "meta refresh is depercated" / "drawbacks using meta refresh")

Curl example

curl --include --raw "https://t. co/sLMy6zi4Yw" --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

Response

Chrome/63.0.3239.132 Safari/537.36"

HTTP/1.1 200 OK

cache-control: private,max-age=300

content-length: 257

content-security-policy: referrer always;

content-type: text/html; charset=utf-8

referrer-policy: unsafe-url

server: tsa_b

strict-transport-security: max-age=0

vary: Origin

x-response-time: 20

x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report

<head><meta name="referrer" content="always"><noscript><META http-equiv="refresh" content="0;URL=http://bit. ly/2n3VDpo"></noscript><title>http://bit. ly/2n3VDpo</title></head><script>window.opener = null;location.replace("http:\/\/bit. ly\/2n3VDpo")</script>

Request without user agent returns

  • status code 301
  • header "location" with the redirect url

Curl example

curl --include --raw "https://t. co/sLMy6zi4Yw"

HTTP/1.1 301 Moved Permanently

cache-control: private,max-age=300

content-length: 0

location: http://bit. ly/2n3VDpo

server: tsa_b

strict-transport-security: max-age=0

vary: Origin

x-response-time: 9

Upvotes: 2

Related Questions