Reputation: 3295
I am trying to parse and manipulate HTML using jsoup. It is working perfectly fine for most URLs but fails on some. e.g.
This works:
Document document = Jsoup.connect("https://www.yahoo.com/politics/time-to-take-sanders-seriously-1342599418519606.html")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11 Firefox/19.0")
.timeout(10*1000)
.get();
This fails:
Document document = Jsoup.connect("http://www.sciencedaily.com/releases/2016/02/160201215944.htm")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11 Firefox/19.0")
.timeout(10*1000)
.get();
Where could I be going wrong?
Thanks.
Upvotes: 0
Views: 350
Reputation: 6171
The page is a regular HTML
. I don't know how to explain it, but if you change your request to a POST
request, you'll get what you want, even though my browser (Firefox) gets the page with a GET
request.
I've tried to add all the pther headers that are sent by the browser - HOST
, ACCEPT
etc., but only changing the requet to POST
did the job.
Upvotes: 1
Reputation: 11712
It seems that in the second example you get a short HTML back with nothing much in it except some JavaScript. So getting the page works fine. The problem is probably that the JavaScript does not get executed, which is outside the scope of JSoup, since Jsoup cannot execute JavaScript.
For that you would need a different approach, e.g. Selenium Webdriver or HTMLUnit.
Upvotes: 1