Reputation: 73
I have a sample code as below.
String sample = "<html>
<head>
</head>
<body>
This is a sample on parsing HTML body using jsoup
This is a sample on parsing HTML body using jsoup
</body>
</html>";
Document doc = Jsoup.parse(sample);
String output = doc.body().text();
I get the output as
This is a sample on parsing HTML body using jsoup This is a sample on `parsing HTML body using jsoup`
But I want the output as
This is a sample on parsing HTML body using jsoup
This is a sample on parsing HTML body using jsoup
How do parse it so that I get this output? Or is there another way to do so in Java?
Upvotes: 6
Views: 2734
Reputation: 1366
The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.
I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).
Upvotes: 0
Reputation: 453
You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text()
to .html()
.
Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();
Upvotes: 10