Bala
Bala

Reputation: 4358

Two operations in one loop vs two loops performing the same operations one per loop

This question is identical to this Two loop bodies or one (result identical) but in my case, I use Java.

I have two loops that runs a billion times.

int a = 188, b = 144, aMax = 0, bMax = 0;

for (int i = 0; i < 1000000000; i++) {
  int t = a ^ i;
  if (t > aMax) 
    aMax = t;     
}  

for (int i = 0; i < 1000000000; i++) {
  int t = b ^ i;
  if (t > bMax) 
    bMax = t;     
}  

The time it takes to run these two loops in my machine is appr 4 secs. When I fuse these two loops into a single loop and perform all the operations in that single loop, then it runs in 2 secs. As you can see trivial operations makes up the loop contents, thus requiring constant time.

My question is where am I getting this performance improvement?

I am guessing that the only possible place where performance gets affected in the two separate loops is that it increments i and checks if i < 1000000000 2 billion times vs only 1 billion times if I fuse the loops together. Is anything else going on in there?

Thanks!

Upvotes: 20

Views: 16790

Answers (4)

Peter Lawrey
Peter Lawrey

Reputation: 533870

In short, the CPU can execute the instructions in the merged loop in parallel, doubling performance.

Its also possible the second loop is not optimised efficiently. This is because the first loop will trigger the whole method to be compiled and the second loop will be compiled without any metrics which can upset the timing of the second loop. I would place each loop in a separate method to make sure this is not the case.

The CPU can perform a large number of independent operation in parallel (depth 10 on Pentium III and 20 in the Xeon). One operation it attempts to do in parallel is a branch, using branch prediction, but if it doesn't take the same branch almost every time.

I suspect with loop unrolling your loop looks more like following (possibly more loop unrolling in this case)

for (int i = 0; i < 1000000000; i += 2) {
  // this first block is run almost in parallel
  int t1 = a ^ i;
  int t2 = b ^ i;
  int t3 = a ^ (i+1);
  int t4 = b ^ (i+1);
  // this block run in parallel
  if (t1 > aMax) aMax = t1;     
  if (t2 > bMax) bMax = t2;     
  if (t3 > aMax) aMax = t3;     
  if (t4 > bMax) bMax = t4;     
} 

Upvotes: 2

ddimitrov
ddimitrov

Reputation: 3343

Did you use -server? If no, you should - the client JIT is not a as predictable, neither as good. If you are really interested in what exactly is going on, you can use UnlockDiagnostic + LogCompilation to check what optimizations are being applied in both cases (all the way down to the generated assembly).

Also, from the code you provided I can't see whether you do warmup, whether you run your test one or multiple times for the same JVM, whether you did it a couple of runs (different JVMs). Whether you are taking into account the best, the average or the median time, do you throw out outliers?

Here is a good link on the subject of writing Java micro-benchmarks: http://www.ibm.com/developerworks/java/library/j-jtp02225/index.html

Edit: One more microbenchmarking tip, beware of on-the-stack replacement: http://www.azulsystems.com/blog/cliff/2011-11-22-what-the-heck-is-osr-and-why-is-it-bad-or-good

Upvotes: 1

assylias
assylias

Reputation: 328883

If you don't run a warm-up phase, it is possible that the first loop gets optimised and compiled but not the second one, whereas when you merge them the whole merged loop gets compiled. Also, using the server option and your code, most gets optimised away as you don't use the results.

I have run the test below, putting each loop as well as the merged loop in their own method and warmimg-up the JVM to make sure everything gets compiled.

Results (JVM options: -server -XX:+PrintCompilation):

  • loop 1 = 500ms
  • loop 2 = 900 ms
  • merged loop = 1,300 ms

So the merged loop is slightly faster, but not that much.

public static void main(String[] args) throws InterruptedException {

    for (int i = 0; i < 3; i++) {
        loop1();
        loop2();
        loopBoth();
    }

    long start = System.nanoTime();

    loop1();

    long end = System.nanoTime();
    System.out.println((end - start) / 1000000);

    start = System.nanoTime();
    loop2();
    end = System.nanoTime();
    System.out.println((end - start) / 1000000);

    start = System.nanoTime();
    loopBoth();
    end = System.nanoTime();
    System.out.println((end - start) / 1000000);
}

public static void loop1() {
    int a = 188, aMax = 0;
    for (int i = 0; i < 1000000000; i++) {
        int t = a ^ i;
        if (t > aMax) {
            aMax = t;
        }
    }
    System.out.println(aMax);
}

public static void loop2() {
    int b = 144, bMax = 0;
    for (int i = 0; i < 1000000000; i++) {
        int t = b ^ i;
        if (t > bMax) {
            bMax = t;
        }
    }
    System.out.println(bMax);
}

public static void loopBoth() {
    int a = 188, b = 144, aMax = 0, bMax = 0;

    for (int i = 0; i < 1000000000; i++) {
        int t = a ^ i;
        if (t > aMax) {
            aMax = t;
        }
        int u = b ^ i;
        if (u > bMax) {
            bMax = u;
        }
    }
    System.out.println(aMax);
    System.out.println(bMax);
}

Upvotes: 6

Cratylus
Cratylus

Reputation: 54094

Seems to me that in the case of a single loop the JIT may opt to do loop unrolling and as a result the performance is slightly better

Upvotes: 1

Related Questions