What would be the best set of XML data for XML parser benchmarking

Question

as described in the title, I'm benchmarking XML parsers in Java to compare them. For now, I'm designing XML documents to run the benchmark. I'm thinking of increasing the complexity of the XML documents by increasing the number of elements, nested level, number of attributes and plain text.

However, my desire to have a single set of testing data only (instead of having a number of different sets which takes time). I'm also thinking of putting the parser until limit (OutOfMemory exception).

Has anyone benchmarked XML parsers before? Any advice of test data design would help a lot.

Michael Kay · Accepted Answer

The best set of XML data for benchmarking is the set that most closely reflects the real workload.

Different users have different requirements. Some are interested in parsing a small number of very large documents, some in parsing a large number of very small documents. Some will do validation (using DTD or schema), others won't. Some will have very dense markup, some very sparse. Some will be primarily English-language (ASCII), others will use Asian languages.

I have to ask why you are doing this. The difference between the slowest and the fastest is unlikely to be more than 20%. Is that difference critical to your business? Will choosing the fastest save you enough money to finance the benchmarking exercise? Might it be cheaper to buy some extra hardware (or cloud resources)?

My other observation is that there is a high risk of putting in a lot of effort and then getting the wrong answer. I've seen no end of published performance figures where elementary mistakes were made in the measurement methodology.

What would be the best set of XML data for XML parser benchmarking

Answers (1)

Related Questions