Reputation: 229
I am trying to build a very rudimentary crawler which could move through certain specific links and extract the contents from them. I am using JSoup for traversing through the links on a page and reading the required content.
However I have hit a roadblock on one of the sites. It is a kind of news portal on which users are allowed to post their own comments. I need to extract these comments. However if there are more than 5 comments, they are spread over several pages and the links to the subsequent pages are created by a JavaScript code in href (instead of a real link). It is something like this:
<a id="pager1_lnkPage2" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("pager1$lnkPage2", "", true, "", "", false, true))">2</a>
Now I have no idea how to traverse through the links generated by this JavaScript. Is there any way to get the data on the pages referred to by these links (on the face of it this does not seem to create any new link since the URL does not change while we navigate through other pages)?
For your reference here is a link to one such page. The links to navigate through multiple pages are at the lower right corner of the page.
This is embedded on the page with the main story in an iframe.
I have also come across an interface called ScriptEngine in javax but I could not understand it well enough to use it here.
Thanks
Upvotes: 3
Views: 257
Reputation: 16971
I've never used jsoup, but judging by its description (it is HTML parser) and the fact you try to somehow incorporate javascript into it, is telling me that you chose wrong tool for the job.
In your case I would rather go with Zombie.js (Node.js based) or Selenium. Latter may be better choice if you want to stick with Java (Selenium has Java based plugins).
Upvotes: 1