Reputation: 566
Task: HTML - Parser in Scala. Im pretty new to scala.
So far: I have written a little Parser in Scala to parse a random html document.
import scala.xml.Elem
import scala.xml.Node
import scala.collection.mutable.Queue
import scala.xml.Text
import scala.xml.PrettyPrinter
object Reader {
def loadXML = {
val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.randomurl.com")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
val feed = adapter.loadXML(source, parser)
feed
}
def proc(node: Node): String =
node match {
case <body>{ txt }</body> => "Partial content: " + txt
case _ => "grmpf"
}
def main(args: Array[String]): Unit = {
val content = Reader.loadXML
Console.println(content)
Console.println(proc(content))
}
}
The problem is that the "proc" does not work. Basically, I would like to get exactly the content of one node. Or is there another way to achieve that without matching?
Does the "feed" in the loadxml-function give me back the right format for parsing or is there a better way to achieve that? Feed gives me back the root node, right?
Thanks in advance
Upvotes: 3
Views: 5535
Reputation: 139038
You're right: adapter.loadXML(source, parser)
gives you the root node. The problem is that that root node probably isn't going to match the body
case in in your proc
method. Even if the root node were body
, it still wouldn't match unless the element contained nothing but text.
You probably want something more like this:
def proc(node: Node): String = (node \\ "body").text
Where \\
is a selector method that's roughly equivalent to XPath's //
—i.e., it returns all the descendants of node
named body
. If you know that body
is a child (as opposed to a deeper descendant) of the root node, which is probably the case for HTML, you can use \
instead of \\
.
Upvotes: 3