Why Java multi-threading does not give a linear speedup to the number of threads

Question

My Java multi-threading code is like the following. I expected if I use n threads (n < available cores), the total execution time will be 1/n of that using a single thread. But the experiment results don't demonstrate this: 4128 ms for 1 thread, 2200 ms for 2 threads, 3114 ms for 3 threads, 3031 ms for 4 threads.

public static void main(String[] args) {
    List doc = createDocuments(50000);
    int numThreads = 4;
    List partitionedDocs = partitionData(doc, numThreads);
    CountDownLatch latch = new CountDownLatch(numThreads);
    List threads = new ArrayList<>();
    List executors = new ArrayList<>();
    for (List input : partitionedDocs) {
        Runnable executor = new SplitDocuments(input, " ", latch);
        executors.add(executor);
        threads.add(new Thread(executor));
    }
    for (Thread t : threads) {
        t.start();
    }
    try {
        latch.await();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

For methods used in main:

public static List createDocuments(Integer size) {
    List x = new ArrayList<>();
    String text = "";
    for (int i=0; i<500; i++) {
        text += "abc ";
    }
    for (int i=0; i partitionData(List input, int numThreads) {
    List results = new ArrayList<>();
    int subListSize = input.size()/numThreads;
    for (int i=0; i


For SplitDocuments class:
public class SplitDocuments implements Runnable {
    private String splitter;
    private List docs = new ArrayList<>();
    private CountDownLatch latch;
    private List result = new ArrayList<>();

    public SplitDocuments(List docs, String splitter, CountDownLatch latch) {
        this.docs = docs;
        this.splitter = splitter;
        this.latch = latch;
    }

    @Override
    public void run() {
        System.out.println("number of docs: " + this.docs.size());
        long start = System.currentTimeMillis();
        for (Document t : this.docs) {
            Document temp = new Document(t);
            temp.setTokens(Arrays.asList(t.text.split(this.splitter)));
            this.result.add(temp);
        }
        this.latch.countDown();
        System.out.println(String.format("split documents costs: %d ms", System.currentTimeMillis() - start));
    }

    public List getResult() {
        return result;
    }
}

AnatolyG · Accepted Answer

I'm not sure I can say anything specific about your timings, since it's still not clear how did you measure/calculate the numbers exactly, but you are right, "multithreading does not give a linear speedup to the number of threads...1/n" in most of the cases. Why?

I'm sure you have heard about Amdahl's law (https://en.wikipedia.org/wiki/Amdahl%27s_law). You can achieve "1/n" only if you don't have any sequential part of the program, a portion of the program that cannot be parallelized. But even if your own code/business logic/algorithm (or its portion to be parallelized) doesn't have such part, there are the underlying software and hardware layers that execute your code and these layers can introduce some sequential parts because of contention on some resources. For example, you can pass only one Ethernet packet at a time over a single-mode fiber or over a single twisted pair; your HDD can write data in only one position of its magnetic heads at a time; you have sequential access to the main memory because of the front-side bus and the memory bus; L2/L3 cache invalidation takes additional work to be done to be synchronized; a translation lookaside buffer (TLB) miss leads to walking to the main memory (see above about the front-side bus and the memory bus); to allocate a new object in JVM in the heap (when Thread-local allocation buffers (TLABs) is full) some internal synchronization required; you can get stop-the-world pauses on GC while compaction, which looks like a sequential portion of your code; classloading and JIT's activity add their sequential-like parts to the application; a 3rd party library may have some synchronization on a static value; etc., etc.

So, you can see "1/n" only in very very simple cases when not your application code, nor the underlying levels have any shared resources used. Here is an example

public class Test {
    public static int NUM_OF_THREADS = 4;
    public static int NUM_OF_ITERATIONS = 100_000_000;

    static class Job extends Thread {
        private final AtomicLong counter = new AtomicLong();

        @Override
        public void run() {
            for (int i = 0; i < NUM_OF_ITERATIONS; i++) {
                counter.incrementAndGet();
            }
        }
    }

    public static void main(String[] args) throws Exception {
        final Job[] jobs = new Job[NUM_OF_THREADS];
        for (int i = 0; i < jobs.length; i++) {
            jobs[i] = new Job();
        }
        final long startNanos = System.nanoTime();
        for (Job job : jobs) {
            job.start();
        }
        for (Job job : jobs) {
            job.join();
        }
        final long spentNanos = System.nanoTime() - startNanos;
        System.out.printf("Time spent (s): %f%n", (spentNanos / 1_000_000_000d));
    }
}

Each thread increments its own instance of the counter. For NUM_OF_THREADS = {1,2,3,4} I get the same execution time ~0.6 sec. This means we have "1/n". We increase the number of calculations, but we don't spend more time. We don't allocate new memory, we don't experience GC pauses therefore, we don't share any data among the threads (one single counter/volatile long sits in the L1 cache of the owner's core)

Now, let's share a resource among the threads. Now the threads increment one single instance of the counter. This should lead to cache invalidation and intensive inter-core communication:

public class Test {
    ...
    private static final AtomicLong counter = new AtomicLong();
    ...
    static class Job extends Thread {
        @Override
        public void run() {
            for (int i = 0; i < NUM_OF_ITERATIONS; i++) {
                counter.incrementAndGet();
            }
        }
    }
    ...

After such small modification, we see dramatic performance degradation. 4 threads have spent ~8 sec to do all their iterations. 8 sec vs 0.6 sec(!)

This is an illustration that it isn't enough to understand such theoretical things like big-O, but it's very important to understand how things work under the hood, how exactly the computer executes our code. It's important to have a performance model, even a very simplified one.

UPDATE: After you have added your measurement into your code... Your measurement looks correct, as well as how you partition the input data. This means you really have a hidden sequential part of the execution. After playing with your code, I think you have a memory management bottleneck, you allocate too much memory in your code and this affects the execution. What I did exactly...

I took your code and added a dummy Document class. Then I tried to configure JVM to let all the data fits into Eden/Young generation with the options: -Xmx8G -Xms8G -XX:NewRatio=10 -XX:+PrintGCDetails

A run with numThreads=1 gave ~1 sec, a run with numThreads=4 gave ~0.2x-0.3x sec. The ratio was ~1/4-1/3. That was expected. GC details looked like this:

Heap
 PSYoungGen      total 2446848K, used 1803997K [0x0000000715580000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 2097664K, 86% used [0x0000000715580000,0x0000000783737448,0x0000000795600000)
  from space 349184K, 0% used [0x00000007aab00000,0x00000007aab00000,0x00000007c0000000)
  to   space 349184K, 0% used [0x0000000795600000,0x0000000795600000,0x00000007aab00000)
 ParOldGen       total 5592576K, used 0K [0x00000005c0000000, 0x0000000715580000, 0x0000000715580000)
  object space 5592576K, 0% used [0x00000005c0000000,0x00000005c0000000,0x0000000715580000)
 Metaspace       used 2930K, capacity 4494K, committed 4864K, reserved 1056768K
  class space    used 299K, capacity 386K, committed 512K, reserved 1048576K

You see neither Eden was full nor Full GC happened.

Then I changed the Number Of Docs from 50_000 to 100_000... A run with numThreads=1 gave ~2 sec, a run with numThreads=4 gave ~1 sec. The ratio was ~1/2. What happened?... GC details contained the following new message:

[GC (Allocation Failure) [PSYoungGen: 2097664K->349175K(2446848K)] 2097664K->1627175K(8039424K), 0,5229844 secs] [Times: user=4,78 sys=0,45, real=0,53 secs]

This means that JVM couldn't allocate a new object in Eden and had to trigger the Garbage Collection process. We spent "real=0,53 secs" inside GC.

If you allocate more memory in your code or have a smaller heap size, you get even worst results (I gave a ratio of ~1/1.4 for the case 1 thread vs 4 threads) because of the Full GC triggered:

[Full GC (Ergonomics) [PSYoungGen: 349172K->0K(2446848K)] [ParOldGen: 4675608K->4925474K(5592576K)] 5024780K->4925474K(8039424K), [Metaspace: 2746K->2746K(1056768K)], 5,7606117 secs] [Times: user=56,58 sys=0,26, real=5,76 secs]

That's why we (in the FinTech area) prefer to write allocation/GC-free code in Java :)

Why Java multi-threading does not give a linear speedup to the number of threads

Answers (1)

Related Questions