Rajesh Mbm
Rajesh Mbm

Reputation: 844

not able to parse complete html of a url using Jsoup

Jsoup library is not parsing complete html of a given url. some divisions are missing from the orignial html of url.

Interesting thing: http://facebook.com/search.php?init=s:email&[email protected]&type=users

if you give url mentioned above in Jsoup's official site http://try.jsoup.org/ it is correctly showing the exact html of the url by fetching, but the same result cant be found in the program using jsoup library.

here is my java code:

String url="http://facebook.com/search.php?init=s:email&[email protected]&type=users";

Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36").get();

String question =document.toString();
System.out.println(" whole content: "+question);

clearly mentioned correct userAgent which is being used in their official site but, in the result, i can see 70% of the original html code, but in the middle somehow i couldn't find few division tags, which is having my desired data.

i tried tried..... no use... why few div tags are missing from the doc.

you can directly take the url and put it into your browser, if you are logged into facebook, you can see the response as: " No results found for your query. Check your spelling or try another term." this is what i am looking for when jsoup parse html of the above mentioned url.

But unfortunately, this part is missing.actually this response is in div id: "#pagelet_search_no_results". i could not find the div with this id in the parsed html. I tried with lot of methods available from jsoup, but no luck.

Upvotes: 2

Views: 4545

Answers (2)

Régis
Régis

Reputation: 8939

You should also set a large timeout, ex.:

Document document = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

Upvotes: 3

luksch
luksch

Reputation: 11712

As far as i know Jsoup restricts the size of the retrieved content to 1M usually. Try this to get the full html source:

Document document = Jsoup.connect(url)
  .userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36")
  .maxBodySize(0)
  .get();

The maxBodySize(0) removes the 1M limit. There are other useful parameters you can set in the connect, like a timeout or cookies.

Upvotes: 4

Related Questions