faghani
faghani

Reputation: 589

retrieving the text of an element in jsoup

When I was using jsoup to parse some html files like "google.com" I encountered with a problem in retreiving the text of an element.

For example in this div element using the text function, the words "Programs" and "Business" are attached to each other which I think it's not right:

<div id="fll" style="margin:19px auto;text-align:center">
   <a href="/intl/en/ads/">Advertising&nbsp;Programs</a>
   <a href="/services/">Business Solutions</a>
   <a href="https://plus.google.com/" rel="publisher">+Google</a>
   <a href="/intl/en/about.html">About Google</a>
</div>

You can test my claim with this code:

URL url = new URL("http://www.google.com");
Document document = Jsoup.parse(url, 10000);
Element element = document.select("div[id=fll]").first();
System.out.println(element.text());

Output will be:

Advertising ProgramsBusiness Solutions+GoogleAbout Google

I want to know that can anything to be done about it?

By the way I traced the code and found out that the problem will be corrected by adding this line:

textNode.text(textNode.text() + " ");

between the lines 755 and 756 of the Element class of the nodes package of the jsoup source code.

Also this problem exists in Elements class of the select package and probably in other text functions!

Upvotes: 1

Views: 1659

Answers (1)

B. Anderson
B. Anderson

Reputation: 3179

The text() method in jsoup returns only the text in an element. In your example, your element is a div. When calling the text() method on it, all of the tags are essentially removed and the text remains. Since Programs doesn't have any space after it, it looks as though it slides right up on Business, which in this case is correct behavior.

If you want the text separately, you can do something like this (untested code):

for (Element a : div.select("a")) {
     System.out.println(a.text());
}

Upvotes: 3

Related Questions