Aparna
Aparna

Reputation: 73

Avoid removal of spaces and newline while parsing HTML using jsoup

I have a sample code as below.

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing HTML body using jsoup
This is a sample on              parsing HTML body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

I get the output as

This is a sample on parsing HTML body using jsoup This is a sample on `parsing HTML body using jsoup`

But I want the output as

This is a sample on              parsing HTML body using jsoup
This is a sample on              parsing HTML body using jsoup

How do parse it so that I get this output? Or is there another way to do so in Java?

Upvotes: 6

Views: 2734

Answers (2)

Markus Fischer
Markus Fischer

Reputation: 1366

The HTML specification requires that multiple whitespace characters are collapsed into a single whitespace. Therefore, when parsing the sample, the parser correctly eliminates the superfluous whitespace characters.

I don't think you can change how the parser works. You could add a preprocessing step where you replace multiple whitespaces with non-breakable spaces ( ), which will not collapse. The side effect, though, would of course be that those would be, well, non-breakable (which doesn't matter if you really just want to use the rendered text, as in doc.body().text()).

Upvotes: 0

Benjamin P.
Benjamin P.

Reputation: 453

You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().

Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

Upvotes: 10

Related Questions