Johan Lindkvist
Johan Lindkvist

Reputation: 1784

Can't get all content from webpage with HTMLParser

I am using Jsoup to parse an webpage this one https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true

In that webpage i can see something in the browser but when i am trying to parse it with Jsoup

Document doc = Jsoup.parse("https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true");
System.out.println(doc);

It will return

<html>
<head></head>
<body>
https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&amp;kind=100&amp;nochange=true&amp;allPrio=trueμltiple=true&amp;allEx=true
</body>
</html>

Which is not all HTML.

Any suggestions how i can solve it or why it is happening?

Upvotes: 0

Views: 1285

Answers (1)

dimo414
dimo414

Reputation: 48824

That looks like they're detecting a crawler, usually via your user agent, and sending different content. Try setting your user agent string to a standard browser's string, and see if that resolves the issue you're having.

One other potential problem, though I don't think it's the issue here, is data loaded in via AJAX will not be downloaded by JSoup. It parses the HTML that gets served up, but it doesn't execute the JavaScript, so it can't get any extra content that comes in later. You might be able to resolve that issue using something like PhantomJS which can process and render HTML, CSS, and JavaScript, and would (in theory) give you the actual HTML you end up seeing in your browser.

Upvotes: 1

Related Questions