user2668128
user2668128

Reputation: 42352

Fastest way to parse flat, attribute-heavy xml in Java or Scala

If I have a big xml file like the following. What would be the fastest way to parse it in Java or Scala. Streaming individual elements is important but not absolutely essential

All I'm interesting in is getting the attribute values from each result object.

<Response>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
</Response>

Upvotes: 1

Views: 4709

Answers (2)

stefan.schwetschke
stefan.schwetschke

Reputation: 8932

Scala XML (can be slow & memory hungry)

The answer from cmbaxter is technical correct, but it can be improved with the "flatMap that shit" pattern :-)

    import io.Source
    import xml.pull._

    // Make it "def", because the Source is stateful and may be exhausted after it is read
    def xmlsrc=Source.fromString("""<Response>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         | </Response>""")

    // Also as "def", because the result is an iterator that may be exhausted
    def xmlEr=new XMLEventReader(xmlsrc)

    // flatMap keeps the "outer shape" of the type it operates, so we are still dealing with an iterator

    def attrs = xmlEr.flatMap{
         |   case e : EvElemStart => e.attrs.map(a => (a.key, a.value))
         |   case _ => Iterable.empty
         | }

    // Now lets look what is inside:
    attrs.foreach(println _)

    // Or just let's collect all values from "att5"
    attrs.collect{ case (name, value) if name == "att5" =>value}.foreach(println _)

Scales XML (faster & needs less memory)

But this will not be fastest way. The Scala API is quite slow and memory hungry compared to other solutions, like benchmarks show. But fortunately there's a faster and less memory hungry solution:

    import scales.utils._
    import ScalesUtils._
    import scales.xml._
    import ScalesXml._
    import java.io.StringReader

    def xmlsrc=new StringReader("""<Response>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         | </Response>""")
    def pull=pullXml(xmlsrc)
    def attributes = pull flatMap {
         |   case Left(elem : Elem) => elem.attributes
         |   case _ => Nil
         | } map (attr => (attr.name, attr.value))

    attributes.foreach(println _)

Don't forget to close you iterators after you are done with them. Here it is not necessary, because I am working with a StringReader.

Anti XML

There is also the Anti XML library, which looks quite nice in benchmarks and seems to have a very nice API. Unfortunately I could not get it to run with Scala 2.10, so I cannot provide a running example.

Conclusion

With the examples above, you should be able to write a small test application. With these you can run your own benchmarks. Looking on the benchmarks quoted above, I guess that Scales XML might solve your problem. But without real meassuring, this is really only a guess.

Benchmark yourself and perhaps you can post your results.

Upvotes: 2

cmbaxter
cmbaxter

Reputation: 35463

If your file is large and you don't want to load the whole thing into memory (i.e. DOM), then one path you could take is the pull parsing route. If you want to do pull parsing in scala, looking for the "start element" event in order to inspect the attributes, then you could do something like this:

import scala.io.Source
import java.io.File
import scala.xml.pull.XMLEventReader
import scala.xml.pull.EvElemStart

val src = Source.fromFile(new File(pathToXml))
val reader = new XMLEventReader(src)
reader foreach{ 
  case EvElemStart(_, _, attrs, _) =>
    //do something here

  case _ =>
}

Following this approach should ensure that your file is not read into memory and should be fast.

Upvotes: 7

Related Questions