Reputation: 519
I have quite a large number of files (around 600) which contain text I scraped with Jsoup. The text contains only HTML within <p>
and <br>
to try and preserve some for of paragraph in the text. The problem is that in some files there is a long sequence of new lines which are read by Java as character 10. In some cases there are over 30 or so, like someone pressing Enter with the key stuck.
I know that it is mostly my fault that the line breaks are there due to the <br>
tags but couldn't find a way to preserve only one line break and ditch the rest whilst scraping.
This is the part of the Jsoup code which I am using (which comes from How do I preserve line breaks when using jsoup to convert html to plain text?)
Document document = Jsoup.connect(url).get();
document.outputSettings(new Document.OutputSettings().prettyPrint(false));//preserve html linebreaks
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
document.select(":containsOwn(\u00a0)").remove();
String s = document.html().replaceAll("\\\\n", "\n");
String txtOnly = Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
Would it be somehow possible to clean the contents of the files somehow without actually re-running the scraping process? I have tried using a HashSet so that only one character 10 is kept and then when the end of line is reached, to print the only character 10 in the set. But it did not work somehow.
Any good pointers as to how to go about doing this please?
Upvotes: 0
Views: 299
Reputation: 8889
In HTML all sequences of 1 or more whitespace characters (including newlines like your character 10s) are equivalent to a single space. You could use a regular expression to replace runs of whitespace characters with a single space. Then do your
replacement to insert newlines in the appropriate places.
public static void processHtml(String html) {
html = normalizeHtmlWhitespace(html);
html = html.replace("<br>", "\n");
// more robust code would use a real HTML parser to do the <br> replacement
}
public static String normalizeHtmlWhitespace(String html) {
return html.replaceAll("\\s+", " ");
}
Upvotes: 1