Ghulam Haider
Ghulam Haider

Reputation: 11

How to get children of class in jsoup

I want to scrape comment from website. I am having trouble to get p tag inside class in jsoup. Example html code is below

<html>
 <head>
  <title>My webpage</title>
 </head>
 <body>
  <div class="container">
     <div class="comment">
      <p>This is comment</p>
     </div>
  </div>
 </body> 
</html> 

Here is my java code

public static void main(String args[]){
    Document doc = null;
    try {

        doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").get();
        System.out.println("Connect successfully");
        org.jsoup.select.Elements element =  doc.select("div.post-message");

        System.out.println(element.get(0).text());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}
}

Upvotes: 0

Views: 1906

Answers (3)

Arijit
Arijit

Reputation: 1674

The comments section of the page you are trying to fetch is not a simple HTML contant. The comments are loaded to the DOM by Javascript after the initial page load. JSoup is an HTML parser, so you can not fetch the comments of the page by Jsoup. To fetch this kind of content you need an embedded browser component. Take a look at this answer : Is there a way to embed a browser in Java?

The below code is for the specific HTML string you provided.

Try this:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;    
public class Test {   

public static void main(String[] arg)
{ 
    Document doc = null; 
    try { 

        doc = Jsoup.parse("<html> "
                + "<head>  "
                + "<title>My webpage</title> "
                + "</head> <body>  <div class=\"container\">     "
                + "<div class=\"comment\">      "
                + "<p>This is comment</p>    "
                + " </div>  </div> </body></html> ");

                Elements element = doc.select(".container").select(".comment"); 
                System.out.println(element.get(0).select("p").text()); 

    } 
    catch (Exception e) 
    { 
        e.printStackTrace(); } 

}   
}

For connecting the url use :

doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").timeout(60*1000).userAgent("Mozilla").get();

Upvotes: 2

Adam Rice
Adam Rice

Reputation: 890

To extend Arijit's solution, if there are multiple <div> tags with a comment class, you could try:

Document doc = null;
    try
    {

        doc = Jsoup.parse("<html> " + "<head>  " + "<title>My webpage</title> "
                + "</head> <body>  <div class=\"container\">     " + "<div class=\"comment foo\">      "
                + "<p>This is comment</p>    " + " </div>  </div> </body></html> ");

        Elements comments = doc.getElementsByAttributeValueMatching("class", "comment");
        Iterator<Element> iter = comments.iterator();
        while(iter.hasNext())
        {
            Element e = iter.next();
            System.out.println(e.getElementsByTag("p").text());
        }

    }
    catch (Exception e)
    {
        e.printStackTrace();
    }

If there are other tags that share the comment class you can use e.tagName() to check that it is a <div>.

Upvotes: 1

Matthew Diana
Matthew Diana

Reputation: 1106

If your goal is to print out This is comment, you could try something like this:

org.jsoup.select.Elements element = doc.select("div.container").select("div.comment");
System.out.println(element.get(0).text());

Upvotes: 0

Related Questions