Reputation: 51
I am facing two problems when using JSoup to scrape data from the web:
It's performance is not that good: it takes a bit too long to connect to a URL.
For some sites, it's not fetching the correct data from the URL. For example, try any URL of the NY Times, like — http://www.nytimes.com/2014/06/13/technology/facebook-to-let-users-alter-their-ad-profiles.html?ref=technology
It just loads the login page, but when I try the same URL on the Google or Facebook, they correctly fetch the data. Also, the URL loads fine in a browser for a guest user.
Upvotes: 0
Views: 798
Reputation: 6754
What's happening here is that the NY Times is using a session cookie to determine if they should show you the content, or redirect you to the login page.
Because JSoup is dropping the cookies, you'll always retrieve the log-in page, rather than getting sent back to the content.
According to the JSoup docs, you can retrieve the cookies on the response using the cookies()
method.
You can then set them on your next request with the cookies(Map<String, String>)
method on Connection.
You can then manage the cookies in your request/response chain.
This isn't specifically a JSoup issue, you can reproduce the same thing with curl on the command line:
curl -v http://www.nytimes.com/…
Returns a "See other" request with the location of the login page
curl -v http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F…
Drops a cookie & gives a "302" request to send you back to the page.
If I request the page again, I'll start the process over, unless I send their session cookie along with my request.
Upvotes: 1