Reputation: 3315
I'm using this code to clean a messy html from word to strip it from essentially everything. I want to keep only text formatting tags and text alignment
[...]
String result = null;
Document html = Jsoup.parse(rawHtml, "/");
html.select("span").unwrap();
Whitelist wl = Whitelist.simpleText();
wl.addTags("div", "span", "p"); // ”
wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = Jsoup.clean(html.body().html(), wl);
return result;
private void editStyle(Document html, String selector, String attrKey, String attrVal) {
Elements values = html.select(selector);
values.attr(attrKey, attrVal);
}
I know it's redudant to have both the align and the style attribute but i'm keeping it only for testing purposes, when i'll be able to fix this i'll remove the align attribute aswell.
This of course doesn't keep the style attributes i add whenever i meet an align tag. So what i want to achieve is to use clean to remove everything except styles containing exclusively a text-align
value (that is, it will clean any other style attribute, even those that contain text-align and something else)
I know that by changing the last part like this it works:
wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
result = Jsoup.clean(html.body().html(), wl);
html = Jsoup.parse(result, "/");
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = html.html();
I get the raw html from clean, parse it again with jsoup, and add the attributes back calling editStyle
at this point rather than before cleaning
But i wanted to know if there's some way to do it in only one step
Upvotes: 1
Views: 981
Reputation: 3315
Since there was no answer to this i'm guessing this is not possible, so I just parse the document again after cleaning, as per the alternative solution i already posted in the question
wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
result = Jsoup.clean(html.body().html(), wl);
html = Jsoup.parse(result, "/");
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = html.html();
Upvotes: 1