Transforming html into xhtml (valid xml) in Scala

Question

I am trying to load valid html for processing in Scala. Seems like converting to xml would be a good starting point. It looks like very nice code at the somewhat controversial scala.xml.Xhtml Scala core library for doing that. Basically it should entail 'fixing up' tags that are valid in html but not valid xml and hence preventing the document from being valid xhtml, and just a bit more. Here is the code from there:

def toXhtml(
    x: Node,
    pscope: NamespaceBinding = TopScope,
    sb: StringBuilder = new StringBuilder,
    stripComments: Boolean = false,
    decodeEntities: Boolean = false,
    preserveWhitespace: Boolean = false,
    minimizeTags: Boolean = true): Unit =
  {
    def decode(er: EntityRef) = XhtmlEntities.entMap.get(er.entityName) match {
      case Some(chr) if chr.toInt >= 128  => sb.append(chr)
      case _                              => er.buildString(sb)
    }
    def shortForm =
      minimizeTags &&
      (x.child == null || x.child.length == 0) &&
      (minimizableElements contains x.label)

    x match {
      case c: Comment                       => if (!stripComments) c buildString sb
      case er: EntityRef if decodeEntities  => decode(er)
      case x: SpecialNode                   => x buildString sb
      case g: Group                         =>
        g.nodes foreach { toXhtml(_, x.scope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags) }

      case _  =>
        sb.append('<')
        x.nameToString(sb)
        if (x.attributes ne null) x.attributes.buildString(sb)
        x.scope.buildString(sb, pscope)

        if (shortForm) sb.append(" />")
        else {
          sb.append('>')
          sequenceToXML(x.child, x.scope, sb, stripComments, decodeEntities, preserveWhitespace, minimizeTags)
          sb.append("')
        }
    }
  }

What seems to take some excessive perseverance is finding how to use that function for an existing html document that has been fetched with scala.io.Source(fromFile). The meaning of the Node type seems a bit elusive in the code base, or I am unsure how to get from the string received from scala.io.Source's fromFile, to something that can be fed into the above copied function toXhtml.

The scaladoc for this function doesn't seem to clarify much.

There's also another related library where the scaladoc only has a zillion entries in it.

I'd be very happy if anyone can say how can a raw html string be converted to 'clean' xhtml using this library, and walk through how to deduce that from the source code, as my Scala is probably not that good I see..

Transforming html into xhtml (valid xml) in Scala

Answers (1)

Related Questions