Reputation: 11
I have the follwing code. The doc.body.text() statement doesn't output the text within the style and the script tags. I read the .text() function code , and it looks for all instances of TextNode. What is a TextNode in Jsoup.
And why is the script text not included in the .text() output.
String contex = "<html><body><style>style</style><div>div</div><script>script</script><p>paragraph</p>body</body></html>";
Document doc = Jsoup.parse(contex, "UTF-8");
String text = doc.body().text();
System.out.println("Test text : " + text);
Output : paragraphbody
Upvotes: 1
Views: 1108
Reputation: 68393
And why is the script text not included in the .text() output.
Because script
and style
has data, not the text.
To get data from script
's data, use getElementsByTag
Elements scriptElements = doc.getElementsByTag("script");
and access by getWholeData
for (Element element :scriptElements ){
for (DataNode node : element.dataNodes()) {
System.out.println(node.getWholeData());
}
System.out.println("-------------------");
}
As per source code, for style
or script
tag is treated as dataNode instead of textNode
void insert(Token.Character characterToken) { Node node; // characters in script and style go in as datanodes, not text nodes final String tagName = currentElement().tagName(); final String data = characterToken.getData(); if (characterToken.isCData()) node = new CDataNode(data); else if (tagName.equals("script") || tagName.equals("style")) node = new DataNode(data); else node = new TextNode(data); currentElement().appendChild(node); // doesn't use insertNode, because we don't foster these; and will always have a stack. }
Upvotes: 1
Reputation: 2751
For this you need to use org.jsoup.select.Elements
to parse the tags like <script>
.
String contex = "<html><body><style>style</style><div>div</div><script>scripts</script><p>paragraph</p><p>body</p><script>787878</script></body></html>";
Document doc =Jsoup.parse(contex, "UTF-8");
Elements scriptElements = doc.getElementsByTag("script");
for (Element el :scriptElements ){
for (DataNode dn : el.dataNodes()) {
System.out.println(dn.getWholeData());
}
}
OP:
scripts
787878
Upvotes: 1