Speed ratio of algorithm versus precompiled reference implementation differs across computers

Question

We have a small C++ project with the following architecture.

These two were compiled into a DLL:

An algorithm
A tester for the algorithm which checks the correctness of the result and measures the execution speed.

Then another implementation of the same algorithm is written by someone else.

The main() function does this:

Invoke the tester on both implementations of the algorithm and measure their execution speed. This is done several times, so that averages can be taken later.
Compute the speed ratio between them (measured time/measured reference time). This is referred to as the score.

We found that running the very same code and DLL on different computers returned quite different speed ratios. On one computer an implementation scored 6.4, and the very same implementation scored 2.8 on another machine. How could that be?

Cornstalks · Accepted Answer

There could be tons of factors, but here are a few:

CPU cache can be a big one. Different processors have different caches (and not just in terms of raw cache size, but also caching strategies). One might be "smarter" than the other, or perhaps one just happens to work better than another in this specific situation.
CPU pipelining. Instructions these days are interleaved in the CPU, even in a single thread of execution. The way the CPU pipeline works varies from CPU to CPU, and one CPU might be able to two particular things at once, while another CPU can't. If one of the implementations exploit this, then it gets a speed boost (or if they both do, then they both get closer to the same speed).
CPU instruction execution times may vary. So one CPU executing the exact same instructions as another CPU might be able to do each one faster than the other CPU. If one computer's CPU takes a longer time to use a particular instruction (and one of the implementations happens to use that instruction), while another CPU has been improved to speed up that instruction's execution time, then there will be a larger time discrepancy.
Branch prediction models in the CPUs might be different, and one implementation might be more or less friendly to a particular CPU's branch prediction model.
Operating systems can affect this in many ways, from memory allocation strategies (maybe one OS has a memory allocation strategy that causes a bigger discrepancy in times, while another OS has a different allocation strategy that minimizes the discrepancy), to CPU time slice management (are the algorithms multithreaded, for example?).

Speed ratio of algorithm versus precompiled reference implementation differs across computers

Answers (1)

Related Questions