Reputation: 2646
I tried to scrape some content from a web site. I used JSoup
. I tried what was,
List<String> songs = new ArrayList<String>();
for (Element s : doc.select("#core")) {
System.out.println(s.html());
songs.add(s.text());
}
for (String chord : songs) {
System.out.println(chord);
}
#core
is a <pre>
tag. In this <pre>
tag, I have a div like following,
Intro: <u>G</u> - <u>Em</u> - <u>C</u> - <u>D</u>
<u>G</u>
Would you dance,
<u>Em</u>
If I asked you to dance?
<u>C</u>
Would you run,
<u>D</u>
And never look back?
<u>G</u>
Would you cry,
<u>Em</u>
If you saw me crying?
<u>C</u> <u>D</u> <u>G</u>
Would you save my soul tonight?
<div id="part1">
<div class="inner">
<u>G</u>
<u>D</u>
<u>C</u> I can be your hero baby
<u>G</u>
<u>D</u>
<u>C</u> I can kiss away the pain
<u>G</u>
<u>D</u>
<u>C</u> I will stand by you forever
<u>G</u>
<u>D</u>
<u>C</u> You can take my breath away
</div>
</div>
When I'm scrapping this, Jsoup
isn't maintain the correct format in div
. Is there a way to get the <pre>
tag content as it is?
Upvotes: 0
Views: 485
Reputation: 17745
If you want to just scrape the content without parsing it, then you can do something like this
Connection.Response response = Jsoup.connect("URL_HERE").execute();
System.out.println(response.body()); //This will keep the format as it is from the server.
If you want to parse the content after that, then do this
response.parse();
If you want to remove some element then you have to parse the content. But if you parse it, then any format that was there will be lost.
A workaround would be to escape the element that you want to keep the whitespaces. Check this from the author of Jsoup https://stackoverflow.com/a/5830454/1138559
Although you have to escape the contents of <pre>
since it contains html elements too.
Upvotes: 1