Reputation: 53
Please, read the newest EDIT of this question.
Issue: I need to write a correct benchmark to compare a different work using different Thread Pool realizations (also from external libraries) using different methods of execution to other work using other Thread Pool realizations and to a work without any threading.
For example I have 24 tasks to complete and 10000 random Strings in benchmark state:
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
@Param({"24"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
private String[] strs = new String[10000];
@Setup
public void setup() {
for (int i = 0; i < strs.length; i++) {
strs[i] = String.valueOf(Math.random());
}
}
}
And two States as inner classes representing Work (string concat.) and ExecutorService setup and shutdown:
@State(Scope.Thread)
public static class Work {
public String doWork(String[] strs) {
StringBuilder conc = new StringBuilder();
for (String str : strs) {
conc.append(str);
}
return conc.toString();
}
}
@State(Scope.Benchmark)
public static class ExecutorServiceState {
ExecutorService service;
@Setup(Level.Iteration)
public void setupMethod() {
service = Executors.newFixedThreadPool(tts);
}
@TearDown(Level.Iteration)
public void downMethod() {
service.shutdownNow();
service = null;
}
}
More strict question is: How to write correct benchmark to measure average time of doWork(); first: without any threading, second: using .execute() method and third: using .submit() method getting results of futures later. Implementation that I tried to wrote:
@Benchmark
public void noThreading(Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i++) {
bh.consume(w.doWork(strs));
}
}
@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i++) {
e.service.execute(() -> bh.consume(w.doWork(strs)));
}
}
@Benchmark
public void noThreadingResult(Work w, Blackhole bh) {
String[] strss = new String[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
strss[i] = w.doWork(strs);
}
bh.consume(strss);
}
@Benchmark
public void executorServiceResult(ExecutorServiceState e, Work w, Blackhole bh) throws ExecutionException, InterruptedException {
Future[] strss = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
strss[i] = e.service.submit(() -> {return w.doWork(strs);});
}
for (Future future : strss) {
bh.consume(future.get());
}
}
After benchmarking this implementation on my PC (2 Cores, 4 threads) I got:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.executorService 24 avgt 3 255102,966 ± 4460279,056 ns/op
ThreadPoolSamples.executorServiceResult 24 avgt 3 19790020,180 ± 7676762,394 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 18881360,497 ± 340778,773 ns/op
ThreadPoolSamples.noThreadingResult 24 avgt 3 19283976,445 ± 471788,642 ns/op
noThreading and executorService maybe correct (but i am still unsure) and noThreadingResult and executorServiceResult doesn't look correct at all.
EDIT:
I find out some new details, but i think the result is still incorrect: as answered user17280749 in this answer that the thread pool wasn't waiting for submitted tasks to complete, but there wasn't only one issue: javac also somehow optimises doWork() method in the Work class (prob the result of that operation was predictable by JVM), so for simplicity I used Thread.sleep() as "work" and also setted amountOfTasks new two params: "1" and "128" to demonstrate that on 1 task threading will be slower than noThreading, and 24 and 128 will be approx. four times faster than noThreading, also to the correctness of measurement I setted thread pools starting up and shutting down in benchmark:
package io.denery;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.*;
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
@Param({"1", "24", "128"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
@State(Scope.Thread)
public static class Work {
public void doWork() {
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
@Benchmark
public void noThreading(Work w) {
for (int i = 0; i < amountOfTasks; i++) {
w.doWork();
}
}
@Benchmark
public void fixedThreadPool(Work w)
throws ExecutionException, InterruptedException {
ExecutorService service = Executors.newFixedThreadPool(tts);
Future[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = service.submit(w::doWork);
}
for (Future future : futures) {
future.get();
}
service.shutdown();
}
@Benchmark
public void cachedThreadPool(Work w)
throws ExecutionException, InterruptedException {
ExecutorService service = Executors.newCachedThreadPool();
Future[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = service.submit(() -> {
w.doWork();
});
}
for (Future future : futures) {
future.get();
}
service.shutdown();
}
}
And the result of this benchmark is:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 3 1169075,866 ± 47607,783 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 3 5208437,498 ± 4516260,543 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 3 13112351,066 ± 1905089,389 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 3 1166087,665 ± 61193,085 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 3 4721503,799 ± 313206,519 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 3 18337097,997 ± 5781847,191 ns/op
ThreadPoolSamples.noThreading 1 avgt 3 1066035,522 ± 83736,346 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 25525744,055 ± 45422,015 ns/op
ThreadPoolSamples.noThreading 128 avgt 3 136126357,514 ± 200461,808 ns/op
We see that error doesn't really huge, and thread pools with task 1 are slower than noThreading, but if you compare 25525744,055 and 4721503,799 the speedup is: 5.406 and it is faster somehow than excpected ~4, and if you compare 136126357,514 and 18337097,997 the speedup is: 7.4, and this fake speedup is growing with amountOfTasks, and i think it is still incorrect. I think to look at this using PrintAssembly to find out is there are any JVM optimisations.
EDIT:
As mentioned user17294549 in this answer, I used Thread.sleep() as imitation of real work and it doesn't correct because:
for real work: only 2 tasks can run simultaneously on a 2-core system for Thread.sleep(): any number of tasks can run simultaneously on a 2-core system
I remembered about Blackhole.consumeCPU(long tokens) JMH method that "burns cycles" and imitating a work, there is JMH example and docs for it. So I changed work to:
@State(Scope.Thread)
public static class Work {
public void doWork() {
Blackhole.consumeCPU(4096);
}
}
And benchmarks for this change:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 3 301187,897 ± 95819,153 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 3 2421815,991 ± 545978,808 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 3 6648647,025 ± 30442,510 ns/op
ThreadPoolSamples.cachedThreadPool 2048 avgt 3 60229404,756 ± 21537786,512 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 3 293364,540 ± 10709,841 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 3 1459852,773 ± 160912,520 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 3 2846790,222 ± 78929,182 ns/op
ThreadPoolSamples.fixedThreadPool 2048 avgt 3 25102603,592 ± 1825740,124 ns/op
ThreadPoolSamples.noThreading 1 avgt 3 10071,049 ± 407,519 ns/op
ThreadPoolSamples.noThreading 24 avgt 3 241561,416 ± 15326,274 ns/op
ThreadPoolSamples.noThreading 128 avgt 3 1300241,347 ± 148051,168 ns/op
ThreadPoolSamples.noThreading 2048 avgt 3 20683253,408 ± 1433365,542 ns/op
We see that fixedThreadPool is somehow slower than the example without threading, and when amountOfTasks is bigger, then difference between fixedThreadPool and noThreading examples is smaller. What's happening in there? Same phenomenon I saw with String concatenation in the beginning of this question, but I didn't report it. (btw, thanks who read this novel and trying to answer this question you're really help me)
Upvotes: 1
Views: 2449
Reputation: 53
I solved this issue by myself with the help of other answerers. In the last edit (And in all other edits) the issue was in my gradle configuration, so I was running this benchmark in all of my system threads, I use this gradle plugin to run JMH and before making all of my benchmarks in my gradle buildscript I set threads = 4
value, so you saw these strange benchmark results because JMH tried to benchmark on all available threads thread pool doing work on all available threads. I removed this configuration and set @State(Scope.Thread)
and @Threads(1)
annotations in benchmark class, a bit edited runInThreadPool()
method to:
public static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
throws InterruptedException, ExecutionException {
Future<?>[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = threadPool.submit(PrioritySchedulerSamples::doWork, (ThreadFactory) runnable -> {
Thread thread = new Thread(runnable);
thread.setPriority(10);
return thread;
});
}
for (Future<?> future : futures) {
bh.consume(future.get());
}
threadPool.shutdownNow();
threadPool.awaitTermination(10, TimeUnit.SECONDS);
}
So each thread in this thread pool runs in maximal priority. And benchmark of all these changes:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
PrioritySchedulerSamples.fixedThreadPool 2048 avgt 3 8021054,516 ± 2874987,327 ns/op
PrioritySchedulerSamples.noThreading 2048 avgt 3 17583295,617 ± 5499026,016 ns/op
These results seems to be correct. (Especially for my system.)
I also made a list of common problems in microbenchmarking thread pools and basically all of the concurrent java components:
@Threads(1)
and @State(Scope.Thread)
annotations to make your microbenchmark executing in one thread. (use, for example, htop
command to find out how many and which threads are consuming the most CPU percentage)Thread.sleep()
for imitating the real work, instead JMH provides Blackhole.consumeCPU(long tokens)
method which you can freely use as imitation of some work.Upvotes: 0
Reputation: 1
Here is what I got on my machine (maybe this might help you understand what's the problem):
This is the benchmark (I modified it a little bit):
package io.denery;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.Main;
import java.util.concurrent.*;
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Threads(1)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@State(Scope.Benchmark)
public class ThreadPoolSamples {
@Param({"1", "24", "128"})
int amountOfTasks;
private static final int tts = Runtime.getRuntime().availableProcessors() * 2;
private static void doWork() {
Blackhole.consumeCPU(4096);
}
public static void main(String[] args) throws Exception {
Main.main(args);
}
@Benchmark
public void noThreading() {
for (int i = 0; i < amountOfTasks; i++) {
doWork();
}
}
@Benchmark
public void fixedThreadPool(Blackhole bh) throws Exception {
runInThreadPool(amountOfTasks, bh, Executors.newFixedThreadPool(tts));
}
@Benchmark
public void cachedThreadPool(Blackhole bh) throws Exception {
runInThreadPool(amountOfTasks, bh, Executors.newCachedThreadPool());
}
private static void runInThreadPool(int amountOfTasks, Blackhole bh, ExecutorService threadPool)
throws Exception {
Future<?>[] futures = new Future[amountOfTasks];
for (int i = 0; i < amountOfTasks; i++) {
futures[i] = threadPool.submit(ThreadPoolSamples::doWork);
}
for (Future<?> future : futures) {
bh.consume(future.get());
}
threadPool.shutdownNow();
threadPool.awaitTermination(5, TimeUnit.MINUTES);
}
}
Specs and versions:
JMH version: 1.33
VM version: JDK 17.0.1, OpenJDK 64-Bit Server
Linux 5.14.14
CPU: Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz, 4 Cores, No Hyper-Threading
Results:
Benchmark (amountOfTasks) Mode Cnt Score Error Units
ThreadPoolSamples.cachedThreadPool 1 avgt 5 92968.252 ± 2853.687 ns/op
ThreadPoolSamples.cachedThreadPool 24 avgt 5 547558.977 ± 88937.441 ns/op
ThreadPoolSamples.cachedThreadPool 128 avgt 5 1502909.128 ± 40698.141 ns/op
ThreadPoolSamples.fixedThreadPool 1 avgt 5 97945.026 ± 435.458 ns/op
ThreadPoolSamples.fixedThreadPool 24 avgt 5 643453.028 ± 135859.966 ns/op
ThreadPoolSamples.fixedThreadPool 128 avgt 5 998425.118 ± 126463.792 ns/op
ThreadPoolSamples.noThreading 1 avgt 5 10165.462 ± 78.008 ns/op
ThreadPoolSamples.noThreading 24 avgt 5 245942.867 ± 10594.808 ns/op
ThreadPoolSamples.noThreading 128 avgt 5 1302173.090 ± 5482.655 ns/op
Upvotes: 0
Reputation: 1
Look at this code:
@TearDown(Level.Iteration)
public void downMethod() {
service.shutdownNow();
service = null;
}
You don't wait for the threads to stop. Read the docs for details.
So some of your benchmarks might run in parallel with another 128 threads spawned by cachedThreadPool
in a previous benchmark.
so for simplicity I used Thread.sleep() as "work"
Are you sure?
There is a big difference between real work and Thread.sleep()
:
Thread.sleep()
: any number of tasks can run simultaneously on a 2-core systemUpvotes: 0
Reputation: 1
See answers to this question to learn how to write benchmarks in java.
... executorService maybe correct (but i am still unsure) ...
Benchmark (amountOfTasks) Mode Cnt Score Error Units ThreadPoolSamples.executorService 24 avgt 3 255102,966 ± 4460279,056 ns/op
It doesn't look correct like a correct result:
the error 4460279,056
is 17 times greater than the base value 255102,966
.
Also you have an error in:
@Benchmark
public void executorService(ExecutorServiceState e, Work w, Blackhole bh) {
for (int i = 0; i < amountOfTasks; i++) {
e.service.execute(() -> bh.consume(w.doWork(strs)));
}
}
You submit the tasks to the ExecutorService
, but doesn't wait for them to complete.
Upvotes: 0