javamultithreadingbenchmarkingmicrobenchmarkjmh

Reputation: 53

JMH - How to correctly benchmark Thread Pools?

Please, read the newest EDIT of this question.

Issue: I need to write a correct benchmark to compare a different work using different Thread Pool realizations (also from external libraries) using different methods of execution to other work using other Thread Pool realizations and to a work without any threading.

For example I have 24 tasks to complete and 10000 random Strings in benchmark state:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
    @Param({"24"})
    int amountOfTasks;
    private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
    private String[] strs = new String[10000];

    @Setup
    public void setup() {
        for (int i = 0; i < strs.length; i++) {
            strs[i] = String.valueOf(Math.random());
        }
    }
}

And two States as inner classes representing Work (string concat.) and ExecutorService setup and shutdown:

@State(Scope.Thread)
public static class Work {
    public String doWork(String[] strs) {
        StringBuilder conc = new StringBuilder();
        for (String str : strs) {
            conc.append(str);
        }
        return conc.toString();
    }
}

@State(Scope.Benchmark)
public static class ExecutorServiceState {
    ExecutorService service;

    @Setup(Level.Iteration)
    public void setupMethod() {
        service = Executors.newFixedThreadPool(tts);
    }

    @TearDown(Level.Iteration)
    public void downMethod() {
        service.shutdownNow();
        service = null;
    }
}

More strict question is: How to write correct benchmark to measure average time of doWork(); first: without any threading, second: using .execute() method and third: using .submit() method getting results of futures later. Implementation that I tried to wrote:

@Benchmark
public void noThreading(Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i++) {
        bh.consume(w.doWork(strs));
    }
}

@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i++) {
         e.service.execute(() -> bh.consume(w.doWork(strs)));
    }
}

@Benchmark
public void noThreadingResult(Work w, Blackhole bh) {
    String[] strss = new String[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i++) {
        strss[i] = w.doWork(strs);
    }
    bh.consume(strss);
}

@Benchmark
public void executorServiceResult(ExecutorServiceState e, Work w, Blackhole bh) throws ExecutionException, InterruptedException {
    Future[] strss = new Future[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i++) {
        strss[i] = e.service.submit(() -> {return w.doWork(strs);});
    }
    for (Future future : strss) {
        bh.consume(future.get());
    }
}

After benchmarking this implementation on my PC (2 Cores, 4 threads) I got:

Benchmark                              (amountOfTasks)  Mode  Cnt         Score         Error  Units
ThreadPoolSamples.executorService                     24  avgt    3    255102,966 ± 4460279,056  ns/op
ThreadPoolSamples.executorServiceResult               24  avgt    3  19790020,180 ± 7676762,394  ns/op
ThreadPoolSamples.noThreading                         24  avgt    3  18881360,497 ±  340778,773  ns/op
ThreadPoolSamples.noThreadingResult                   24  avgt    3  19283976,445 ±  471788,642  ns/op

noThreading and executorService maybe correct (but i am still unsure) and noThreadingResult and executorServiceResult doesn't look correct at all.

EDIT:

I find out some new details, but i think the result is still incorrect: as answered user17280749 in this answer that the thread pool wasn't waiting for submitted tasks to complete, but there wasn't only one issue: javac also somehow optimises doWork() method in the Work class (prob the result of that operation was predictable by JVM), so for simplicity I used Thread.sleep() as "work" and also setted amountOfTasks new two params: "1" and "128" to demonstrate that on 1 task threading will be slower than noThreading, and 24 and 128 will be approx. four times faster than noThreading, also to the correctness of measurement I setted thread pools starting up and shutting down in benchmark:

package io.denery;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.concurrent.*;

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
    @Param({"1", "24", "128"})
    int amountOfTasks;
    private static final int tts = Runtime.getRuntime().availableProcessors() * 2;

    @State(Scope.Thread)
    public static class Work {
        public void doWork() {
            try {
                Thread.sleep(1);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    @Benchmark
    public void noThreading(Work w) {
        for (int i = 0; i < amountOfTasks; i++) {
            w.doWork();
        }
    }

    @Benchmark
    public void fixedThreadPool(Work w)
            throws ExecutionException, InterruptedException {
        ExecutorService service = Executors.newFixedThreadPool(tts);
        Future[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i++) {
            futures[i] = service.submit(w::doWork);
        }
        for (Future future : futures) {
            future.get();
        }

        service.shutdown();
    }

    @Benchmark
    public void cachedThreadPool(Work w)
            throws ExecutionException, InterruptedException {
        ExecutorService service = Executors.newCachedThreadPool();
        Future[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i++) {
            futures[i] = service.submit(() -> {
                w.doWork();
            });
        }
        for (Future future : futures) {
            future.get();
        }

        service.shutdown();
    }
}

And the result of this benchmark is:

Benchmark                         (amountOfTasks)  Mode  Cnt          Score         Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    3    1169075,866 ±   47607,783  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    3    5208437,498 ± 4516260,543  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    3   13112351,066 ± 1905089,389  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    3    1166087,665 ±   61193,085  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    3    4721503,799 ±  313206,519  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    3   18337097,997 ± 5781847,191  ns/op
ThreadPoolSamples.noThreading                     1  avgt    3    1066035,522 ±   83736,346  ns/op
ThreadPoolSamples.noThreading                    24  avgt    3   25525744,055 ±   45422,015  ns/op
ThreadPoolSamples.noThreading                   128  avgt    3  136126357,514 ±  200461,808  ns/op

We see that error doesn't really huge, and thread pools with task 1 are slower than noThreading, but if you compare 25525744,055 and 4721503,799 the speedup is: 5.406 and it is faster somehow than excpected ~4, and if you compare 136126357,514 and 18337097,997 the speedup is: 7.4, and this fake speedup is growing with amountOfTasks, and i think it is still incorrect. I think to look at this using PrintAssembly to find out is there are any JVM optimisations.

EDIT:

As mentioned user17294549 in this answer, I used Thread.sleep() as imitation of real work and it doesn't correct because:

for real work: only 2 tasks can run simultaneously on a 2-core system
for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system

I remembered about Blackhole.consumeCPU(long tokens) JMH method that "burns cycles" and imitating a work, there is JMH example and docs for it. So I changed work to:

@State(Scope.Thread)
public static class Work {
    public void doWork() {
        Blackhole.consumeCPU(4096);
    }
}

And benchmarks for this change:

Benchmark                         (amountOfTasks)  Mode  Cnt         Score          Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    3    301187,897 ±    95819,153  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    3   2421815,991 ±   545978,808  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    3   6648647,025 ±    30442,510  ns/op
ThreadPoolSamples.cachedThreadPool             2048  avgt    3  60229404,756 ± 21537786,512  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    3    293364,540 ±    10709,841  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    3   1459852,773 ±   160912,520  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    3   2846790,222 ±    78929,182  ns/op
ThreadPoolSamples.fixedThreadPool              2048  avgt    3  25102603,592 ±  1825740,124  ns/op
ThreadPoolSamples.noThreading                     1  avgt    3     10071,049 ±      407,519  ns/op
ThreadPoolSamples.noThreading                    24  avgt    3    241561,416 ±    15326,274  ns/op
ThreadPoolSamples.noThreading                   128  avgt    3   1300241,347 ±   148051,168  ns/op
ThreadPoolSamples.noThreading                  2048  avgt    3  20683253,408 ±  1433365,542  ns/op

We see that fixedThreadPool is somehow slower than the example without threading, and when amountOfTasks is bigger, then difference between fixedThreadPool and noThreading examples is smaller. What's happening in there? Same phenomenon I saw with String concatenation in the beginning of this question, but I didn't report it. (btw, thanks who read this novel and trying to answer this question you're really help me)

Upvotes: 1

Answers (4)

ExtraExtremeERA

Reputation: 53

I solved this issue by myself with the help of other answerers. In the last edit (And in all other edits) the issue was in my gradle configuration, so I was running this benchmark in all of my system threads, I use this gradle plugin to run JMH and before making all of my benchmarks in my gradle buildscript I set threads = 4 value, so you saw these strange benchmark results because JMH tried to benchmark on all available threads thread pool doing work on all available threads. I removed this configuration and set @State(Scope.Thread) and @Threads(1) annotations in benchmark class, a bit edited runInThreadPool() method to:

public static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
            throws InterruptedException, ExecutionException {
        Future<?>[] futures = new Future[amountOfTasks];
        for (int i = 0; i < amountOfTasks; i++) {
            futures[i] = threadPool.submit(PrioritySchedulerSamples::doWork, (ThreadFactory) runnable -> {
                Thread thread = new Thread(runnable);
                thread.setPriority(10);
                return thread;
            });
        }
        for (Future<?> future : futures) {
            bh.consume(future.get());
        }

        threadPool.shutdownNow();
        threadPool.awaitTermination(10, TimeUnit.SECONDS);
    }

So each thread in this thread pool runs in maximal priority. And benchmark of all these changes:

Benchmark                                 (amountOfTasks)  Mode  Cnt         Score         Error  Units
PrioritySchedulerSamples.fixedThreadPool             2048  avgt    3   8021054,516 ± 2874987,327  ns/op
PrioritySchedulerSamples.noThreading                 2048  avgt    3  17583295,617 ± 5499026,016  ns/op

These results seems to be correct. (Especially for my system.)

I also made a list of common problems in microbenchmarking thread pools and basically all of the concurrent java components:

Make sure your microbenchmark is executing in one thread, use @Threads(1) and @State(Scope.Thread) annotations to make your microbenchmark executing in one thread. (use, for example, htop command to find out how many and which threads are consuming the most CPU percentage)
Make sure you execute the task completely in your microbenchmark, and wait for all Threads to complete this task. (maybe your microbenchmark doesn't wait for tasks to complete?)
Don't use Thread.sleep() for imitating the real work, instead JMH provides Blackhole.consumeCPU(long tokens) method which you can freely use as imitation of some work.
Make sure you know the component that you benchmark. (obvious, but I didn't know before this post java thread pools very well)
Make sure you know compiler optimization effects described in these JMH samles basically know JMH very well.

Upvotes: 0

user17298733

Reputation: 1

Here is what I got on my machine (maybe this might help you understand what's the problem):

This is the benchmark (I modified it a little bit):

package io.denery;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.Main;
import java.util.concurrent.*;

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Threads(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
  @Param({"1", "24", "128"})
  int amountOfTasks;
  private static final int tts = Runtime.getRuntime().availableProcessors() * 2;

  private static void doWork() {
    Blackhole.consumeCPU(4096);
  }

  public static void main(String[] args) throws Exception {
    Main.main(args);
  }

  @Benchmark
  public void noThreading() {
    for (int i = 0; i < amountOfTasks; i++) {
      doWork();
    }
  }

  @Benchmark
  public void fixedThreadPool(Blackhole bh) throws Exception {
    runInThreadPool(amountOfTasks, bh, Executors.newFixedThreadPool(tts));
  }

  @Benchmark
  public void cachedThreadPool(Blackhole bh) throws Exception {
    runInThreadPool(amountOfTasks, bh, Executors.newCachedThreadPool());
  }

  private static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
      throws Exception {
    Future<?>[] futures = new Future[amountOfTasks];
    for (int i = 0; i < amountOfTasks; i++) {
      futures[i] = threadPool.submit(ThreadPoolSamples::doWork);
    }
    for (Future<?> future : futures) {
      bh.consume(future.get());
    }

    threadPool.shutdownNow();
    threadPool.awaitTermination(5, TimeUnit.MINUTES);
  }
}

Specs and versions:

JMH version: 1.33  
VM version: JDK 17.0.1, OpenJDK 64-Bit Server
Linux 5.14.14
CPU: Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz, 4 Cores, No Hyper-Threading

Results:

Benchmark                           (amountOfTasks)  Mode  Cnt        Score        Error  Units
ThreadPoolSamples.cachedThreadPool                1  avgt    5    92968.252 ±   2853.687  ns/op
ThreadPoolSamples.cachedThreadPool               24  avgt    5   547558.977 ±  88937.441  ns/op
ThreadPoolSamples.cachedThreadPool              128  avgt    5  1502909.128 ±  40698.141  ns/op
ThreadPoolSamples.fixedThreadPool                 1  avgt    5    97945.026 ±    435.458  ns/op
ThreadPoolSamples.fixedThreadPool                24  avgt    5   643453.028 ± 135859.966  ns/op
ThreadPoolSamples.fixedThreadPool               128  avgt    5   998425.118 ± 126463.792  ns/op
ThreadPoolSamples.noThreading                     1  avgt    5    10165.462 ±     78.008  ns/op
ThreadPoolSamples.noThreading                    24  avgt    5   245942.867 ±  10594.808  ns/op
ThreadPoolSamples.noThreading                   128  avgt    5  1302173.090 ±   5482.655  ns/op

Upvotes: 0

user17294549

Reputation: 1

Look at this code:

    @TearDown(Level.Iteration)
    public void downMethod() {
        service.shutdownNow();
        service = null;
    }

You don't wait for the threads to stop. Read the docs for details.
So some of your benchmarks might run in parallel with another 128 threads spawned by cachedThreadPool in a previous benchmark.

so for simplicity I used Thread.sleep() as "work"

Are you sure?
There is a big difference between real work and Thread.sleep():

for real work: only 2 tasks can run simultaneously on a 2-core system
for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system

Upvotes: 0

user17280749

Reputation: 1

See answers to this question to learn how to write benchmarks in java.

... executorService maybe correct (but i am still unsure) ...

Benchmark                              (amountOfTasks)  Mode  Cnt         Score         Error  Units
ThreadPoolSamples.executorService                     24  avgt    3    255102,966 ± 4460279,056  ns/op

It doesn't look correct like a correct result: the error 4460279,056 is 17 times greater than the base value 255102,966.

Also you have an error in:

@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
    for (int i = 0; i < amountOfTasks; i++) {
         e.service.execute(() -> bh.consume(w.doWork(strs)));
    }
}

You submit the tasks to the ExecutorService, but doesn't wait for them to complete.

Upvotes: 0

JMH - How to correctly benchmark Thread Pools?

Answers (4)

Related Questions