user8788828
user8788828

Reputation: 11

Java Jsoup : Extract all the text

I have the follwing code. The doc.body.text() statement doesn't output the text within the style and the script tags. I read the .text() function code , and it looks for all instances of TextNode. What is a TextNode in Jsoup.

And why is the script text not included in the .text() output.

String contex = "<html><body><style>style</style><div>div</div><script>script</script><p>paragraph</p>body</body></html>";
    Document doc = Jsoup.parse(contex, "UTF-8");
    String text = doc.body().text();
    System.out.println("Test text : " + text);

Output : paragraphbody

Upvotes: 1

Views: 1108

Answers (2)

gurvinder372
gurvinder372

Reputation: 68393

And why is the script text not included in the .text() output.

Because script and style has data, not the text.

To get data from script's data, use getElementsByTag

Elements scriptElements = doc.getElementsByTag("script");

and access by getWholeData

for (Element element :scriptElements ){                
    for (DataNode node : element.dataNodes()) {
        System.out.println(node.getWholeData());
    }
    System.out.println("-------------------");            
}

As per source code, for style or script tag is treated as dataNode instead of textNode

void insert(Token.Character characterToken) {
        Node node;
        // characters in script and style go in as datanodes, not text nodes
        final String tagName = currentElement().tagName();
        final String data = characterToken.getData();

        if (characterToken.isCData())
            node = new CDataNode(data);
        else if (tagName.equals("script") || tagName.equals("style"))
            node = new DataNode(data);
        else
            node = new TextNode(data);
        currentElement().appendChild(node); // doesn't use insertNode, because we don't foster these; and will always have a
   stack.
    }

Upvotes: 1

Shubhendu Pramanik
Shubhendu Pramanik

Reputation: 2751

For this you need to use org.jsoup.select.Elements to parse the tags like <script>.

String contex = "<html><body><style>style</style><div>div</div><script>scripts</script><p>paragraph</p><p>body</p><script>787878</script></body></html>";
        Document doc =Jsoup.parse(contex, "UTF-8");
         Elements scriptElements = doc.getElementsByTag("script");

         for (Element el :scriptElements ){                
                for (DataNode dn : el.dataNodes()) {
                    System.out.println(dn.getWholeData());
                }
          }

OP:

scripts
787878

Upvotes: 1

Related Questions