Praveen Thirukonda
Praveen Thirukonda

Reputation: 375

Which is better for processing a huge file in Java - XML or Serialized file?

I have a huge file (3GB+) as a XML. Currently, i read in the XML in my Java code, parse it and store in a HashMap and then the HashMap is used as a lookup. This process is done about 1000 times in 1000 different JVMs for each run of this code. The 1000 different JVMs operate on 1000 partitions of the input data, hence this process has to occur 1000 times.

I was wondering as a one time activity, if i serialized the HashMap and then stored the output. And then in the java program just deserialize the HashMap and avoid parsing the XML files 1000 times.

Will this help up speed up the code a lot? or is the serialization overhead going nullify any gains?

EDIT: 1. The 1000 different JVMs operate on 1000 partitions of the input data, hence this process has to occur 1000 times.

Upvotes: 1

Views: 530

Answers (4)

Michael Kay
Michael Kay

Reputation: 163342

I would say from my experience that the best format for serializing XML is as XML. The XML representation will generally be smaller than the output of Java serialization and therefore faster to load. But try it and see.

What isn't clear to me is why you need to serialize the partitions at all, unless your processing is highly distributed (e.g. on a cluster without shared memory).

With Saxon-EE you can do the processing like this:

<xsl:template name="main">
  <xsl:stream href="big-input.xml">
    <xsl:for-each select="/*/partition" saxon:threads="50">
      <xsl:sequence select="f:process-one-partition(copy-of(.))"/>
    </xsl:for-each>
  </xsl:stream>
</xsl:template>

The function f:process-one-partition can be written either in Java or in XSLT.

The memory needed for this will be of the order of number-of-threads * size-of-one-partition.

Upvotes: 0

Peter Lawrey
Peter Lawrey

Reputation: 533530

You might consider using Chronicle Map. It can be loaded once in off heap memory and shared across multiple JVMs without having to deserialize it. i.e. it uses very little heap and you only need to read the entries you map.get(key)

It works by memory mapping the file so you don't pay the price of loading it multiple times once the first program brings it into memory it can stay in memory even if there is no program using it.

Disclaimer: I helped write it.

Upvotes: 1

Tim B
Tim B

Reputation: 41188

It is likely that the serialized file will be faster, but there are no guarantees. The only way to be sure will be for you to try it on your machine and benchmark it to measure the difference. Just be aware of all the issues like JIT warmup etc that you need to do to get a good benchmark result.

The best way to get good performance will be to read the file once and keep it in memory. There are overheads to doing that but if you are calling it often enough that would be worthwhile. You should really think about using a database for something like this as well, you could always use a lightweight database running locally.

Upvotes: 0

SamC
SamC

Reputation: 31

Why are you loading and parsing the same map 1000 times? If nothing else, you could just make a copy of the first one you load to avoid reading another 3GB+ from disk.

Upvotes: 0

Related Questions