Low performance of Java application running inside c4.large AWS instance

Question

Im trying to perform calculation in two threads inside c4.large (machine with two cores) instance on AWS using Java 1.8 and Ubuntu. After adding second thread calculation slow down from 26 seconds to 34 per thread. I checked usage of cores and after adding second thread second core has 100% usage. On local computer with two cores processor two threads don't slow down threads.

c4.large instance:
Thread 0 start
Thread 0 time: 26 seconds
Thread 1 start
Thread 0 time: 29 seconds
Thread 1 time: 34 seconds
Thread 0 time: 34 seconds
Thread 1 time: 34 seconds
Thread 0 time: 34 seconds

How to improve below code or change configuration in system to improve performance?

import java.io.IOException;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import java.util.function.DoubleUnaryOperator;
import java.util.stream.DoubleStream;

public class TestCalculate {

    private Random rnd = ThreadLocalRandom.current();

    private DoubleStream randomPoints(long points, double a, double b) {
        return  rnd.doubles(points)
                .limit(points)
                        .map(d -> a + d * (b - a));
    }

    public static void main(String[] args) throws SecurityException, IOException {
        DoubleUnaryOperator du = x -> (x * Math.sqrt(23.35 * x * x) / Math.sqrt(34.54653324234324 * x) / Math.sqrt(213.3123)) * Math.sqrt(1992.34513213124 / x) / 88392.3 * x + 3.234324;


        for (int i=0 ; i < 2; i++){
            int j = i ;
            new Thread(() -> {
                TestCalculate test = new TestCalculate();
                int x = 0;
                System.out.println("Thread "+j+" start");
                long start = System.currentTimeMillis();
                while (x++ < 4) {
                    double d = test.randomPoints(500_000_000l, 2, 10).map(du).sum();
                    long end = (System.currentTimeMillis() - start) / 1000;
                    System.out.println("Thread "+j+" time: "+end+" seconds, result: "+d);
                    start = System.currentTimeMillis();
                }
            }).start();

            try {
                Thread.sleep(40_000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }

        }
    }
}

BeeOnRope · Accepted Answer

On the Amazon instance types page you find this note:

Each vCPU is a hyperthread of an Intel Xeon core except for T2.

Since your c4.large instance has 2 vCPUs, what you are really getting is both hyperthreads of a single CPU core, not two independent cores. Given that, it's entirely expected that running two threads doesn't double the throughput since both threads are competing for resources on the same core. You saw a ~53% increase in throughput when adding the second thread, which actually means that this code is quite hyperthread-friendly, since an average speedup for the second hyperthread is usually considered to be in the 30% range.

You can reproduce this result locally, although on my Skylake CPU the hyperthreading penalty is apparently much lower. When I run a slightly modified version⁰ TestCalculate by restricting it to two different physical cores on my 4-core, 8-hyperthread as follows:

taskset -c 0,1 java stackoverflow.TestCalculate

I get the following results:

Thread 0 start
Thread 0: time:  2.21 seconds, result: 161774948.858291
Thread 0: time:  2.18 seconds, result: 161774943.838121
Thread 0: time:  2.18 seconds, result: 161774946.789039
Thread 1 start
Thread 1: time:  2.18 seconds, result: 161774945.535877
Thread 0: time:  2.18 seconds, result: 161774947.073892
Thread 1: time:  2.18 seconds, result: 161774937.356786
Thread 0: time:  2.18 seconds, result: 161774940.460682
Thread 1: time:  2.18 seconds, result: 161774944.699141
Thread 0: time:  2.18 seconds, result: 161774941.643486
Thread 0 stop
Thread 1: time:  2.18 seconds, result: 161774943.018521
Thread 1: time:  2.18 seconds, result: 161774941.866168
Thread 1: time:  2.18 seconds, result: 161774944.035612
Thread 1 stop

That is, there is approximately "perfect" scaling when adding a second thread, when each thread can run on a different core: the per-thread performance is the same to two decimal places.

On the other hand, when I run the process restricted to the same physical core¹ like:

taskset -c 0,4 java stackoverflow.TestCalculate

I get the following results:

Thread 0 start
Thread 0: time:  2.22 seconds, result: 161774949.278913
Thread 0: time:  2.19 seconds, result: 161774932.329415
Thread 0: time:  2.18 seconds, result: 161774943.604470
Thread 1 start
Thread 0: time:  2.31 seconds, result: 161774951.630203
Thread 1: time:  2.31 seconds, result: 161774951.695466
Thread 0: time:  2.31 seconds, result: 161774939.631680
Thread 1: time:  2.31 seconds, result: 161774943.523282
Thread 0: time:  2.32 seconds, result: 161774948.153244
Thread 0 stop
Thread 1: time:  2.32 seconds, result: 161774956.985513
Thread 1: time:  2.18 seconds, result: 161774950.335522
Thread 1: time:  2.18 seconds, result: 161774941.739148
Thread 1: time:  2.18 seconds, result: 161774946.275329
Thread 1 stop

So there was a 6% slowdown when running on the same core. That means this code is very hyperthread friendly, since a 6% slowdown means you got a 94% benefit by adding hyperthreading! Skylake had several micro-architectural improvements that specifically helped hyperthreading scenarios, which perhaps explains the difference between your c4.large results (Haswell architecture) and mine. You might try on EC2 C5 instances since they use the Skylake architecture: if the drop is much smaller it would confirm this theory.

⁰ Modified to make the iteration time 10x shorter and to start the second thread deterministically after 3 iterations with a single thread.

¹ On my box, logical CPUs 0 and 4, 1 and 5, etc, belong to the same physical core.

Low performance of Java application running inside c4.large AWS instance

Answers (1)

Related Questions