Reputation: 12101
I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?
Example:
<!DOCTYPE html>
<html>
<head>
<script>
var html = "<span>some text</span>";
</script>
</head>
<body>
<p>text</p>
</body>
</html>
In this example Jsoup only picks up the text from p
tag which is what it's supposed to do. How do I pick up the text from var html
span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.
Upvotes: 3
Views: 14520
Reputation: 8879
You can use Jsoup to parse all the <script>
-tags into DataNode
-objects.
DataNode
A data node, for contents of style, script tags etc, where contents should not show in text().
Elements scriptTags = doc.getElementsByTag("script");
This will give you all the Elements of tag <script>
.
You can then use the getWholeData()
-method to extract the node.
// Get the data contents of this node. String getWholeData()
for (Element tag : scriptTags){
for (DataNode node : tag.dataNodes()) {
System.out.println(node.getWholeData());
}
}
Upvotes: 6
Reputation: 11
I am not so sure about the answer, but I saw a similar situation before here.
You probably can use Jsoup and manual parsing to get the text according to that answer.
I just modify that piece of code for your specific case:
Document doc = ...
Element script = doc.select("script").first(); // Get the script part
Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part
while( m.find() )
{
System.out.println(m.group()); // the whole html text
System.out.println(m.group(1)); // value only
}
Hope it will be helpful.
Upvotes: 1