Reputation: 11
I want to scrape comment from website. I am having trouble to get p tag inside class in jsoup. Example html code is below
<html>
<head>
<title>My webpage</title>
</head>
<body>
<div class="container">
<div class="comment">
<p>This is comment</p>
</div>
</div>
</body>
</html>
Here is my java code
public static void main(String args[]){
Document doc = null;
try {
doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").get();
System.out.println("Connect successfully");
org.jsoup.select.Elements element = doc.select("div.post-message");
System.out.println(element.get(0).text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Upvotes: 0
Views: 1906
Reputation: 1674
The comments section of the page you are trying to fetch is not a simple HTML contant. The comments are loaded to the DOM by Javascript after the initial page load. JSoup is an HTML parser, so you can not fetch the comments of the page by Jsoup. To fetch this kind of content you need an embedded browser component. Take a look at this answer : Is there a way to embed a browser in Java?
The below code is for the specific HTML string you provided.
Try this:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] arg)
{
Document doc = null;
try {
doc = Jsoup.parse("<html> "
+ "<head> "
+ "<title>My webpage</title> "
+ "</head> <body> <div class=\"container\"> "
+ "<div class=\"comment\"> "
+ "<p>This is comment</p> "
+ " </div> </div> </body></html> ");
Elements element = doc.select(".container").select(".comment");
System.out.println(element.get(0).select("p").text());
}
catch (Exception e)
{
e.printStackTrace(); }
}
}
For connecting the url use :
doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").timeout(60*1000).userAgent("Mozilla").get();
Upvotes: 2
Reputation: 890
To extend Arijit's solution, if there are multiple <div>
tags with a comment
class, you could try:
Document doc = null;
try
{
doc = Jsoup.parse("<html> " + "<head> " + "<title>My webpage</title> "
+ "</head> <body> <div class=\"container\"> " + "<div class=\"comment foo\"> "
+ "<p>This is comment</p> " + " </div> </div> </body></html> ");
Elements comments = doc.getElementsByAttributeValueMatching("class", "comment");
Iterator<Element> iter = comments.iterator();
while(iter.hasNext())
{
Element e = iter.next();
System.out.println(e.getElementsByTag("p").text());
}
}
catch (Exception e)
{
e.printStackTrace();
}
If there are other tags that share the comment
class you can use e.tagName()
to check that it is a <div>
.
Upvotes: 1
Reputation: 1106
If your goal is to print out This is comment
, you could try something like this:
org.jsoup.select.Elements element = doc.select("div.container").select("div.comment");
System.out.println(element.get(0).text());
Upvotes: 0