Parsing HTML without minimizing elements + Transformation

Question

I am writing a plugin for a web application that takes user provided HTML and transforms it to a different piece of HTML code. I mostly want to find all elements with given class/content ("directives") and rewrite it to something else. I am using Scala 2.11.1 and TagSoup parser to deal with XML-unfriendly code.

My main problem at the moment is that the call to XML.parseString("

") yields:

scala> XML.loadString("")
res2: scala.xml.Elem =

This behaviour garbles the resulting page (i.e. iframes, divs etc.) as I want to leave this tags unminimized. Is there a way to avoid this behaviour in the loading phase?

The second problem is related to TagSoup. When parsing a block of code like:

TagSoup parses it as

Is there anything that can be done to avoid these problems? I have come up only with "nasty" solutions so far like rewriting all elements to be unminimized and removing all entities from the content of

Parsing HTML without minimizing elements + Transformation

Answers (0)

Related Questions