Reputation: 13686
I am trying to load valid html for processing in Scala. Seems like converting to xml would be a good starting point. It looks like very nice code at the somewhat controversial scala.xml.Xhtml Scala core library for doing that. Basically it should entail 'fixing up' tags that are valid in html but not valid xml and hence preventing the document from being valid xhtml, and just a bit more. Here is the code from there:
def toXhtml(
x: Node,
pscope: NamespaceBinding = TopScope,
sb: StringBuilder = new StringBuilder,
stripComments: Boolean = false,
decodeEntities: Boolean = false,
preserveWhitespace: Boolean = false,
minimizeTags: Boolean = true): Unit =
{
def decode(er: EntityRef) = XhtmlEntities.entMap.get(er.entityName) match {
case Some(chr) if chr.toInt >= 128 => sb.append(chr)
case _ => er.buildString(sb)
}
def shortForm =
minimizeTags &&
(x.child == null || x.child.length == 0) &&
(minimizableElements contains x.label)
x match {
case c: Comment => if (!stripComments) c buildString sb
case er: EntityRef if decodeEntities => decode(er)
case x: SpecialNode => x buildString sb
case g: Group =>
g.nodes foreach { toXhtml(_, x.scope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags) }
case _ =>
sb.append('<')
x.nameToString(sb)
if (x.attributes ne null) x.attributes.buildString(sb)
x.scope.buildString(sb, pscope)
if (shortForm) sb.append(" />")
else {
sb.append('>')
sequenceToXML(x.child, x.scope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags)
sb.append("</")
x.nameToString(sb)
sb.append('>')
}
}
}
What seems to take some excessive perseverance is finding how to use that function for an existing html document that has been fetched with scala.io.Source(fromFile)
. The meaning of the Node
type seems a bit elusive in the code base, or I am unsure how to get from the string received from scala.io.Source's fromFile
, to something that can be fed into the above copied function toXhtml
.
The scaladoc for this function doesn't seem to clarify much.
There's also another related library where the scaladoc only has a zillion entries in it.
I'd be very happy if anyone can say how can a raw html string be converted to 'clean' xhtml using this library, and walk through how to deduce that from the source code, as my Scala is probably not that good I see..
Upvotes: 0
Views: 557
Reputation: 1041
You might consider using jsoup for this since it excels at dealing with messy, real-world HTML. It can also scrub HTML based on a whitelist of allowed tags. An example:
import org.jsoup.Jsoup
import org.jsoup.safety.Whitelist
import scala.collection.JavaConversions._
import scala.io.Source
object JsoupExample extends App {
val suspectHtml = Source.fromURL("http://en.wikipedia.org/wiki/Scala_(programming_language)").mkString
val cleanHtml = Jsoup.clean(suspectHtml, Whitelist.basic)
val doc = Jsoup.parse(cleanHtml)
doc.select("p").foreach(node => println(node.text))
}
Upvotes: 2