Souper
Souper

Reputation: 1267

Why is Document.html() so slow?

I was under the impression that the most costly method in Jsoup's API is parse().

But I just discovered that Document.html() could be even slower.

Given that the Document is the output of parse() (i.e. this is after parsing), I find this surprising.

Why is Document.html() so slow?

Upvotes: 4

Views: 474

Answers (1)

Souper
Souper

Reputation: 1267

Answering myself. The Element.html() method is implemented as:

public String html() {
  StringBuilder accum = new StringBuilder();
  html(accum); 
  return accum.toString().trim();
}

Using StringBuilder instead of String is already a good thing, and the use of StringBuilder.toString() and String.trim() may not explain the slowness of Document.html(), even for a relatively large document.

But in the middle, our method calls an overloaded version, Element.html(StringBuilder) which loops through all child nodes in the document:

private void html(StringBuilder accum) {
  for (Node node : childNodes)
    node.outerHtml(accum);
}

Thus if the document contains lots of child nodes, it will be slow.

It would be interesting to see whether there could be a faster implementation of this.

For example, if Jsoup stores a cached version of the raw html that was provided to it via Jsoup.parse(). As an option of course, to maintain backward compatibility and small footprint in memory.

Upvotes: 7

Related Questions