Why is Document.html() so slow?

Question

I was under the impression that the most costly method in Jsoup's API is parse().

But I just discovered that Document.html() could be even slower.

Given that the Document is the output of parse() (i.e. this is after parsing), I find this surprising.

Why is Document.html() so slow?

Souper · Accepted Answer

Answering myself. The Element.html() method is implemented as:

public String html() {
  StringBuilder accum = new StringBuilder();
  html(accum); 
  return accum.toString().trim();
}

Using StringBuilder instead of String is already a good thing, and the use of StringBuilder.toString() and String.trim() may not explain the slowness of Document.html(), even for a relatively large document.

But in the middle, our method calls an overloaded version, Element.html(StringBuilder) which loops through all child nodes in the document:

private void html(StringBuilder accum) {
  for (Node node : childNodes)
    node.outerHtml(accum);
}

Thus if the document contains lots of child nodes, it will be slow.

It would be interesting to see whether there could be a faster implementation of this.

For example, if Jsoup stores a cached version of the raw html that was provided to it via Jsoup.parse(). As an option of course, to maintain backward compatibility and small footprint in memory.

Why is Document.html() so slow?

Answers (1)

Related Questions