efritz
efritz

Reputation: 5203

Microbenchmarking with Big Data

I'm currently working on my thesis project designing a cache implementation to be used with a shortest-path graph algorithm. The graph algorithm is rather inconsistent with runtimes, so it's too troublesome to benchmark the entire algorithm. I must concentrate on benchmarking the cache only.

The caches I need to benchmark are about a dozen or so implementations of a Map interface. These caches are designed to work well with a given access pattern (the order in which keys are queried from the algorithm above). However, in a given run of a "small" problem, there are a few hundred billion queries. I need to run almost all of them to be confident about the results of the benchmark.

I'm having conceptual problems about loading the data into memory. It's possible to create a query log, which would just be an on-disk ordered list of all the keys (they're 10-character string identifiers) that were queried in one run of the algorithm. This file is huge. The other thought I had would be to break the log up into chunks of 1-5 million queries, and benchmark in the following manner:

  1. Load 1-5 million keys
  2. Set start time to current time
  3. Query them in order
  4. Record elapsed time (current time - start time)

I'm unsure of what effects this will have with caching. How could I perform a warm-up period? Loading the file will likely clear any data that was in L1 or L2 caches for the last chunk. Also, what effect does maintaining a 1-5 million element string array have (and does even iterating it skew results)?

Keep in mind that the access pattern is important! For example, there are some hash tables with move-to-front heuristics, which re-orders the internal structure of the table. It would be incorrect to run a single chunk multiple times, or run chunks out of order. This makes warming CPU caches and HotSpot a bit more difficult (I could also keep a secondary dummy cache that's used for warming but not timing).

What are good practices for microbenchmarks with a giant dataset?

Upvotes: 3

Views: 147

Answers (1)

DNA
DNA

Reputation: 42617

If I understand the problem correctly, how about loading the query log on one machine, possibly in chunks if you don't have enough memory, and streaming it to the machine running the benchmark, across a dedicated network (a crossover cable, probably), so you have minimal interference between the system under test, and the test code/data ...?

Whatever solution you use, you should try multiple runs so you can assess repeatability - if you don't get reasonable repeatability then you can at least detect that your solution is unsuitable!

Updated: re: batching and timing - in practice, you'll probably end up with some form of fine-grained batching, at least, to get the data over the network efficiently. If your data falls into natural large 'groups' or stages then I would time those individually to check for anomalies, but would rely most strongly on an overall timing. I don't see much benefit from timing small batches of thousands (given that you are running through many millions).

Even if you run everything on one machine with a lot of RAM, it might be worth loading the data in one JVM and the code under test on another, so that garbage collection on the cache JVM is not (directly) affected by the large heap required to hold the query log.

Upvotes: 1

Related Questions