Reputation: 589
When I was using jsoup to parse some html files like "google.com" I encountered with a problem in retreiving the text of an element.
For example in this div
element using the text
function, the words
"Programs" and "Business" are attached to each other which I think
it's not right:
<div id="fll" style="margin:19px auto;text-align:center">
<a href="/intl/en/ads/">Advertising Programs</a>
<a href="/services/">Business Solutions</a>
<a href="https://plus.google.com/" rel="publisher">+Google</a>
<a href="/intl/en/about.html">About Google</a>
</div>
You can test my claim with this code:
URL url = new URL("http://www.google.com");
Document document = Jsoup.parse(url, 10000);
Element element = document.select("div[id=fll]").first();
System.out.println(element.text());
Output will be:
Advertising ProgramsBusiness Solutions+GoogleAbout Google
I want to know that can anything to be done about it?
By the way I traced the code and found out that the problem will be corrected by adding this line:
textNode.text(textNode.text() + " ");
between the lines 755 and 756 of the Element
class of the nodes
package of the jsoup
source code.
Also this problem exists in Elements
class of the select
package and probably in other text
functions!
Upvotes: 1
Views: 1659
Reputation: 3179
The text()
method in jsoup returns only the text in an element. In your example, your element is a div
. When calling the text()
method on it, all of the tags are essentially removed and the text remains. Since Programs doesn't have any space after it, it looks as though it slides right up on Business, which in this case is correct behavior.
If you want the text separately, you can do something like this (untested code):
for (Element a : div.select("a")) {
System.out.println(a.text());
}
Upvotes: 3