Mugoma J. Okomba
Mugoma J. Okomba

Reputation: 3295

Jsoup fails on certain sites

I am trying to parse and manipulate HTML using jsoup. It is working perfectly fine for most URLs but fails on some. e.g.

This works:

Document document = Jsoup.connect("https://www.yahoo.com/politics/time-to-take-sanders-seriously-1342599418519606.html")
        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11 Firefox/19.0")
        .timeout(10*1000)
        .get();

This fails:

Document document = Jsoup.connect("http://www.sciencedaily.com/releases/2016/02/160201215944.htm")
        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11 Firefox/19.0")
        .timeout(10*1000)
        .get();

Where could I be going wrong?

Thanks.

Upvotes: 0

Views: 350

Answers (2)

TDG
TDG

Reputation: 6171

The page is a regular HTML. I don't know how to explain it, but if you change your request to a POST request, you'll get what you want, even though my browser (Firefox) gets the page with a GET request.
I've tried to add all the pther headers that are sent by the browser - HOST, ACCEPT etc., but only changing the requet to POST did the job.

Upvotes: 1

luksch
luksch

Reputation: 11712

It seems that in the second example you get a short HTML back with nothing much in it except some JavaScript. So getting the page works fine. The problem is probably that the JavaScript does not get executed, which is outside the scope of JSoup, since Jsoup cannot execute JavaScript.

For that you would need a different approach, e.g. Selenium Webdriver or HTMLUnit.

Upvotes: 1

Related Questions