Reputation: 42352
If I have a big xml file like the following. What would be the fastest way to parse it in Java or Scala. Streaming individual elements is important but not absolutely essential
All I'm interesting in is getting the attribute values from each result object.
<Response>
<Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
<Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
<Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
<Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
</Response>
Upvotes: 1
Views: 4709
Reputation: 8932
The answer from cmbaxter is technical correct, but it can be improved with the "flatMap that shit" pattern :-)
import io.Source
import xml.pull._
// Make it "def", because the Source is stateful and may be exhausted after it is read
def xmlsrc=Source.fromString("""<Response>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| </Response>""")
// Also as "def", because the result is an iterator that may be exhausted
def xmlEr=new XMLEventReader(xmlsrc)
// flatMap keeps the "outer shape" of the type it operates, so we are still dealing with an iterator
def attrs = xmlEr.flatMap{
| case e : EvElemStart => e.attrs.map(a => (a.key, a.value))
| case _ => Iterable.empty
| }
// Now lets look what is inside:
attrs.foreach(println _)
// Or just let's collect all values from "att5"
attrs.collect{ case (name, value) if name == "att5" =>value}.foreach(println _)
But this will not be fastest way. The Scala API is quite slow and memory hungry compared to other solutions, like benchmarks show. But fortunately there's a faster and less memory hungry solution:
import scales.utils._
import ScalesUtils._
import scales.xml._
import ScalesXml._
import java.io.StringReader
def xmlsrc=new StringReader("""<Response>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
| </Response>""")
def pull=pullXml(xmlsrc)
def attributes = pull flatMap {
| case Left(elem : Elem) => elem.attributes
| case _ => Nil
| } map (attr => (attr.name, attr.value))
attributes.foreach(println _)
Don't forget to close you iterators after you are done with them. Here it is not necessary, because I am working with a StringReader
.
There is also the Anti XML library, which looks quite nice in benchmarks and seems to have a very nice API. Unfortunately I could not get it to run with Scala 2.10, so I cannot provide a running example.
With the examples above, you should be able to write a small test application. With these you can run your own benchmarks. Looking on the benchmarks quoted above, I guess that Scales XML might solve your problem. But without real meassuring, this is really only a guess.
Benchmark yourself and perhaps you can post your results.
Upvotes: 2
Reputation: 35463
If your file is large and you don't want to load the whole thing into memory (i.e. DOM), then one path you could take is the pull parsing route. If you want to do pull parsing in scala, looking for the "start element" event in order to inspect the attributes, then you could do something like this:
import scala.io.Source
import java.io.File
import scala.xml.pull.XMLEventReader
import scala.xml.pull.EvElemStart
val src = Source.fromFile(new File(pathToXml))
val reader = new XMLEventReader(src)
reader foreach{
case EvElemStart(_, _, attrs, _) =>
//do something here
case _ =>
}
Following this approach should ensure that your file is not read into memory and should be fast.
Upvotes: 7