Caballero
Caballero

Reputation: 12101

How to parse html from javascript variables with Jsoup in Java?

I'm using Jsoup to parse html file and pull all the visible text from elements. The problem is that there are some html bits in javascript variables which are obviously ignored. What would be the best solution to get those bits out?

Example:

<!DOCTYPE html>
<html>
<head>
    <script>
        var html = "<span>some text</span>";
    </script>
</head>
<body>
    <p>text</p>
</body>
</html>

In this example Jsoup only picks up the text from p tag which is what it's supposed to do. How do I pick up the text from var html span? The solution must be applied to thousands of different pages, so I can't rely on something like javascript variable having the same name.

Upvotes: 3

Views: 14520

Answers (2)

Daniel B
Daniel B

Reputation: 8879

You can use Jsoup to parse all the <script>-tags into DataNode-objects.

DataNode

A data node, for contents of style, script tags etc, where contents should not show in text().

 Elements scriptTags = doc.getElementsByTag("script");

This will give you all the Elements of tag <script>.

You can then use the getWholeData()-method to extract the node.

// Get the data contents of this node.
String    getWholeData() 
 for (Element tag : scriptTags){                
        for (DataNode node : tag.dataNodes()) {
            System.out.println(node.getWholeData());
        }        
  }

Jsoup API - DataNode

Upvotes: 6

KK4SBB
KK4SBB

Reputation: 11

I am not so sure about the answer, but I saw a similar situation before here.

You probably can use Jsoup and manual parsing to get the text according to that answer.

I just modify that piece of code for your specific case:

Document doc = ...
Element script = doc.select("script").first(); // Get the script part


Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part


while( m.find() )
{
    System.out.println(m.group()); // the whole html text
    System.out.println(m.group(1)); // value only
}

Hope it will be helpful.

Upvotes: 1

Related Questions