Reputation: 1267
I was under the impression that the most costly method in Jsoup's API is parse().
But I just discovered that Document.html() could be even slower.
Given that the Document
is the output of parse()
(i.e. this is after parsing), I find this surprising.
Why is Document.html()
so slow?
Upvotes: 4
Views: 474
Reputation: 1267
Answering myself. The Element.html() method is implemented as:
public String html() {
StringBuilder accum = new StringBuilder();
html(accum);
return accum.toString().trim();
}
Using StringBuilder instead of String is already a good thing, and the use of StringBuilder.toString()
and String.trim()
may not explain the slowness of Document.html()
, even for a relatively large document.
But in the middle, our method calls an overloaded version, Element.html(StringBuilder)
which loops through all child nodes in the document:
private void html(StringBuilder accum) {
for (Node node : childNodes)
node.outerHtml(accum);
}
Thus if the document contains lots of child nodes, it will be slow.
It would be interesting to see whether there could be a faster implementation of this.
For example, if Jsoup stores a cached version of the raw html that was provided to it via Jsoup.parse()
. As an option of course, to maintain backward compatibility and small footprint in memory.
Upvotes: 7