Reputation: 1072
I am writing a plugin for a web application that takes user provided HTML and transforms it to a different piece of HTML code. I mostly want to find all elements with given class/content ("directives") and rewrite it to something else. I am using Scala 2.11.1 and TagSoup parser to deal with XML-unfriendly code.
My main problem at the moment is that the call to XML.parseString("<div></div>")
yields:
scala> XML.loadString("<div></div>")
res2: scala.xml.Elem = <div/>
This behaviour garbles the resulting page (i.e. iframe
s, div
s etc.) as I want to leave this tags unminimized. Is there a way to avoid this behaviour in the loading phase?
The second problem is related to TagSoup. When parsing a block of code like:
<script type="javascript">console.log("Hello");</script>
TagSoup parses it as
<script type="javascript">console.log("Hello");</script>
Is there anything that can be done to avoid these problems? I have come up only with "nasty" solutions so far like rewriting all elements to be unminimized and removing all entities from the content of <script>
tags.
The TagSoup parsing is done like this:
import java.net.URL
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
import org.xml.sax
import org.xml.sax.InputSource
import scala.xml._
import parsing.NoBindingFactoryAdapter
object HTML {
lazy val adapter = new NoBindingFactoryAdapter
lazy val parser = (new SAXFactoryImpl).newSAXParser()
def load(source: InputSource) = adapter.loadXML(source, parser)
def loadString(source: String) = load(Source.fromString(source))
def loadURL(url: URL) = load(new sax.InputSource(url.openConnection().getInputStream))
}
Upvotes: 1
Views: 324