valepu
valepu

Reputation: 3315

Keep attributes with certain value when cleaning html with Jsoup

I'm using this code to clean a messy html from word to strip it from essentially everything. I want to keep only text formatting tags and text alignment

[...]
String result = null;
Document html = Jsoup.parse(rawHtml, "/");
html.select("span").unwrap();
Whitelist wl = Whitelist.simpleText();
wl.addTags("div", "span", "p"); // ”
wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = Jsoup.clean(html.body().html(), wl);
return result;

private void editStyle(Document html, String selector, String attrKey, String attrVal) {
    Elements values = html.select(selector);
    values.attr(attrKey, attrVal);
}

I know it's redudant to have both the align and the style attribute but i'm keeping it only for testing purposes, when i'll be able to fix this i'll remove the align attribute aswell.

This of course doesn't keep the style attributes i add whenever i meet an align tag. So what i want to achieve is to use clean to remove everything except styles containing exclusively a text-align value (that is, it will clean any other style attribute, even those that contain text-align and something else)

I know that by changing the last part like this it works:

wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
result = Jsoup.clean(html.body().html(), wl);
html = Jsoup.parse(result, "/");
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = html.html();

I get the raw html from clean, parse it again with jsoup, and add the attributes back calling editStyle at this point rather than before cleaning

But i wanted to know if there's some way to do it in only one step

Upvotes: 1

Views: 981

Answers (1)

valepu
valepu

Reputation: 3315

Since there was no answer to this i'm guessing this is not possible, so I just parse the document again after cleaning, as per the alternative solution i already posted in the question

wl.addAttributes(":all", "align");
html.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
result = Jsoup.clean(html.body().html(), wl);
html = Jsoup.parse(result, "/");
this.editStyle(html, "[align='center']", "style", "text-align: center");
this.editStyle(html, "[align='justify']", "style", "text-align: justify");
result = html.html();

Upvotes: 1

Related Questions