shayan
shayan

Reputation: 1241

clean html using white list and keep some attributes on the white listed elements

the code piece below in scala using Jsoup allows me to clean a string from any html tags except for those explicitly in the white list:

val whiteList = Whitelist.none().addTags(
    "b", "br", "ul", "ol", "li", "em", "h4", "h5", "hr", "pre", "sub", "sup"
   )
Jsoup.clean("some unsafe text", whiteList)

the process indiscriminately strips all css styling and element attributes from the tags inside text which is desired for the general case. But what I want is for the process to retain the direction css property or possibly the dir attribute on the block elements of the white list.

I don't have a problem with an answer written in java.

Upvotes: 1

Views: 262

Answers (1)

shayan
shayan

Reputation: 1241

I solved it by passing the unsafe text to a custom recursive method like this:

val whiteList = List(
  "b", "br", "ul", "ol", "li", "em", "h4", "h5", "hr", "pre", "sub", "sup"
)
def clean(raw: String): String = {
  def traverseAndClean(elem: Element): Unit = {
    if (!whiteList.contains(elem.tagName())) {
      elem.remove()
    } else {
      elem.attributes().forEach { attr =>
        val key = attr.getKey
        if (key != "dir") elem.removeAttr(key)
      }
      elem.children().iterator().forEachRemaining(traverseAndClean)
    }
  }
  val doc = Jsoup.parseBodyFragment(raw)
  doc.body().children().iterator().forEachRemaining(traverseAndClean)
  doc.body().html()
}

clean("my unsafe text")

Upvotes: 1

Related Questions