codebot
codebot

Reputation: 2646

remove <div> tag from html in JSoup

I tried to scrape some content from a web site. I used JSoup. I tried what was,

List<String> songs = new ArrayList<String>();
for (Element s : doc.select("#core")) {
    System.out.println(s.html());
    songs.add(s.text());
}

for (String chord : songs) {
    System.out.println(chord);
}

#core is a <pre> tag. In this <pre> tag, I have a div like following,

Intro: <u>G</u> - <u>Em</u> - <u>C</u> - <u>D</u>
<u>G</u>
Would you dance,
<u>Em</u>
If I asked you to dance?
<u>C</u>
Would you run,
<u>D</u>
And never look back?
<u>G</u>
Would you cry,
<u>Em</u>
If you saw me crying?
<u>C</u>        <u>D</u>     <u>G</u>
Would you save my soul tonight?

<div id="part1">

    <div class="inner">
        <u>G</u>
        <u>D</u>
        <u>C</u> I can be your hero baby
        <u>G</u>
        <u>D</u>
        <u>C</u> I can kiss away the pain
        <u>G</u>
        <u>D</u>
        <u>C</u> I will stand by you forever
        <u>G</u>
        <u>D</u>
        <u>C</u> You can take my breath away
    </div>
 </div>

When I'm scrapping this, Jsoup isn't maintain the correct format in div. Is there a way to get the <pre> tag content as it is?

Upvotes: 0

Views: 485

Answers (1)

Alkis Kalogeris
Alkis Kalogeris

Reputation: 17745

If you want to just scrape the content without parsing it, then you can do something like this

Connection.Response response = Jsoup.connect("URL_HERE").execute();
System.out.println(response.body()); //This will keep the format as it is from the server.

If you want to parse the content after that, then do this

response.parse();

If you want to remove some element then you have to parse the content. But if you parse it, then any format that was there will be lost.

A workaround would be to escape the element that you want to keep the whitespaces. Check this from the author of Jsoup https://stackoverflow.com/a/5830454/1138559 Although you have to escape the contents of <pre> since it contains html elements too.

Upvotes: 1

Related Questions