remove
tag from html in JSoup

Question

I tried to scrape some content from a web site. I used JSoup. I tried what was,

List songs = new ArrayList();
for (Element s : doc.select("#core")) {
    System.out.println(s.html());
    songs.add(s.text());
}

for (String chord : songs) {
    System.out.println(chord);
}

#core is a

 tag. In this  tag, I have a div like following,

Intro: G - Em - C - D
G
Would you dance,
Em
If I asked you to dance?
C
Would you run,
D
And never look back?
G
Would you cry,
Em
If you saw me crying?
C        D     G
Would you save my soul tonight?



    
        G
        D
        C I can be your hero baby
        G
        D
        C I can kiss away the pain
        G
        D
        C I will stand by you forever
        G
        D
        C You can take my breath away
    
 


When I'm scrapping this, Jsoup isn't maintain the correct format in div. Is there a way to get the 
 tag content as it is?

Alkis Kalogeris · Accepted Answer

If you want to just scrape the content without parsing it, then you can do something like this

Connection.Response response = Jsoup.connect("URL_HERE").execute();
System.out.println(response.body()); //This will keep the format as it is from the server.

If you want to parse the content after that, then do this

response.parse();

If you want to remove some element then you have to parse the content. But if you parse it, then any format that was there will be lost.

A workaround would be to escape the element that you want to keep the whitespaces. Check this from the author of Jsoup https://stackoverflow.com/a/5830454/1138559 Although you have to escape the contents of

since it contains html elements too.

remove <div> tag from html in JSoup

Answers (1)

Related Questions

remove &lt;div&gt; tag from html in JSoup

Answers (1)

Related Questions

remove <div> tag from html in JSoup