Ivan
Ivan

Reputation: 95

Replace text in all text nodes in a tree using Jsoup

For example we need to uppercase all text inside some html tags. I can do this like this:

String htmlText = "<h1>Apollo 11</h1><p><strong>Apollo 11</strong> "
            + "was the spaceflight that landed the first humans, Americans <strong>"
            + "<a href=\"http://en.wikipedia.org/wiki/Neil_Armstrong\">Neil Armstrong</a></strong> and... </p>";
Document document = Jsoup.parse(htmlText);
Elements textElements = document.select("h1, p");

for (Element element : textElements) {
        List<TextNode> textNodes = element.textNodes();
        for (TextNode textNode : textNodes){
            textNode.text(textNode.text().toUpperCase());
        }

}
System.out.println(document.html());

Result: <html><head></head><body><h1>APOLLO 11</h1><p><strong>Apollo 11</strong> WAS THE SPACEFLIGHT THAT LANDED THE FIRST HUMANS, AMERICANS <strong><a href="http://en.wikipedia.org/wiki/Neil_Armstrong">Neil Armstrong</a></strong> AND... </p></body></html>

So all text within child elements was not uppercased (< strong>Apollo 11< /strong>).

I can loop thru elements and check for nodes and child elements like this:

for (Node node : element.childNodes()){
    if (node instanceof TextNode) {
        String nodeText = ((TextNode) node).text();
        nodeText = nodeText.toUpperCase();
        ((TextNode) node).text(nodeText);
    } else {
        String nodeText = ((Element) node).text();
        nodeText = nodeText.toUpperCase();
        ((Element) node).text(nodeText);
    }
}

But ((Element) node).text() will cut all child tags and we got: <html><head></head><body><h1>APOLLO 11</h1><p><strong>APOLLO 11</strong> WAS THE SPACEFLIGHT THAT LANDED THE FIRST HUMANS, AMERICANS <strong>NEIL ARMSTRONG</strong> AND... </p></body></html>

Notice missed link tag on "NEIL ARMSTRONG".

We can add another inner loop, and also check in it for TextNode and Element, but I don't think this is a solution.

So my question is, how to make manipulations on text in all Elements/TextNodes in some html tree and keep all child tags untouched?

Upvotes: 3

Views: 1903

Answers (1)

fabian
fabian

Reputation: 82461

Just apply the first approach to all selected nodes and it's decendants:

Elements textElements = document.select("h1, p");

for (Element e : textElements.select("*")) {
    for (TextNode tn : e.textNodes()) {
        tn.text(tn.text().toUpperCase());
    }
}

Upvotes: 4

Related Questions