Chris
Chris

Reputation: 18876

Jsoup Fine Grained Parse

I am trying to go through every html tag on a webpage's body and see if it has text in it. If it does, I would like a print out of that text:

  Document doc = Jsoup.connect(site).get();     
    Elements e = doc.body().getAllElements();
      for (int i=0; i<e.size(); i++){
         if(doc.body().child(i).hasText()){
        System.out.println(doc.body().child(i).text());
          }
       }

Above works, but not how I want it. It seems the child() method is not fine grained as it clumps multiple 'div class' elements together. How can I traverse the DOMs Body in a more fine-grained manner to see what each and every tag's text is?

Thank you in advance.

Upvotes: 0

Views: 379

Answers (2)

Rodrigo Gauzmanf
Rodrigo Gauzmanf

Reputation: 2527

    Document doc = Jsoup.connect(site).get();
    doc.body().traverse(new NodeVisitor() {

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode tn = ((TextNode) node);
                // Try to improve this filter for the nodes who contain
                // texts with a whitespaces
                if (tn.text().replaceAll("\\s*", "").length() > 0) {
                    System.out.println("Tag:" + tn.parent().nodeName()
                            + ", text:" + tn.text());
                }
            }
        }

        @Override
        public void tail(Node node, int depth) {
            // Do Nothing
        }
    });

Upvotes: 1

vacuum
vacuum

Reputation: 2273

You can use this approach

And inside traverse you can check if current node is a TextNode:

if(node intanceof TextNode) {
  System.out.println(node.text());
}

If you trying to print out all the text. why you dont use text() from Elements class?

Upvotes: 1

Related Questions