Virtual Threads Performance Degradation when running a mixture of CPU-intensive and blocking task

Question

I am using Spring Boot 3.4.0 (JDK21) with embedded Tomcat.

I was testing the functionality of virtual threads, with one of the tests being:

@GetMapping("/test")
    public void mostIOLowCPU() throws InterruptedException {
        long startTime = System.currentTimeMillis();

        while (System.currentTimeMillis() - startTime < 100) {
            double x = Math.random();  // Random operation to keep CPU busy
        }
        Thread.sleep(900);
    }

I used JMeter to conduct load testing on this endpoint. When using virtual threads (turning on the spring.threads.virtual.enabled) option, with ~500QPS (5000 number of users, 10s ramp-up time in JMeter), the average latency was about 19 seconds per operation. When using a normal thread pool with the following properties:

server:
  tomcat:
    threads:
      max: 10000
    max-connections: 10000
    accept-count: 10000

To ensure that the test does not bottleneck at the number of threads (which might lead us to falsely assume that platform threads are really that much slower, but in reality it's because of a limited number of threads). The average latency is way lower at 1.027 seconds.

I did thread dumps for

Virtual Threads
newThreadPerTaskExecutor Executor (that produces roughly the same performance as the thread pool above). I used this so that it "mimics" the behaviour of virtual threads, where a thread is created per task.

I did thread dumps at 2-3 seconds intervals. I noticed that

For virtual threads, the same virtual thread isn't awakened after 4 seconds, even though it's only supposed to sleep for 900ms (0.9s):

However, it's been executing new tasks
New Virtual Threads executing the CPU-intensive task

When I use platform threads (i.e. new platform thread per task), the threads are almost simultaneously finishes execution after the 900ms sleep is done. The maximum latency is around 1200ms.

Why is this the case? My guess is that its' related to how the JVM schedules virtual threads to unpark and park onto plaform threads.

dan1st · Accepted Answer

Even though your task is mostly IO bound, there is still a CPU bound tasks.

Virtual threads cannot be woken up when all carrier threads are busy

When a virtual thread is performing work on the CPU, it cannot be unmounted as mentioned in JEP 444:

The scheduler does not currently implement time sharing for virtual threads. Time sharing is the forceful preemption of a thread that has consumed an allotted quantity of CPU time. While time sharing can be effective at reducing the latency of some tasks when there are a relatively small number of platform threads and CPU utilization is at 100%, it is not clear that time sharing would be as effective with a million virtual threads.

So, assuming you have P platform threads (typically the number of cores), use V>P virtual threads for executing your task and at least P of your virtual threads are executing CPU work, none of the other virtual threads have any chance of doing anything.

So, let's take a look at your code:

long startTime = System.currentTimeMillis();

while (System.currentTimeMillis() - startTime < 100) {
    double x = Math.random();  // Random operation to keep CPU busy
}
Thread.sleep(900);

Assuming that System.currentTimeMillis() doesn't block (I don't think it does), the first part is fully CPU bound. Let's say you have V=100 virtual threads using P=4 platform threads. Then, 4 of your virtual threads (which ones is not defined) are executing the CPU bound part while all other virtual threads don't even start.

//as soon as V (4) virtual threads get to that point, no other virtual thread can do anything
long startTime = System.currentTimeMillis();

while (System.currentTimeMillis() - startTime < 100) {
    double x = Math.random();  // Random operation to keep CPU busy
}

Only when a virtual thread gets to the Thread.sleep(900);, another thread can start with long startTime = System.currentTimeMillis();. Virtual threads cannot be woken up when all carrier threads are busy. So, the CPU bound part would take (at least) 100*V/P milliseconds as you are doing the work sequentially.

Your "losing" virtual threads "starve" and don't get executed until the "winning" virtual threads finish the CPU bound part.

For virtual threads, the same virtual thread isn't awakened after 4 seconds, even though it's only supposed to sleep for 900ms (0.9s):

When the virtual thread is sleeping, other virtual threads run the CPU bound part. When it should wake up, it cannot get unmounted because the carrier threads are still busy doing the work for other virtual threads.

As long as you have at least as many virtual threads actively doing CPU work as available carrier threads, you are essentially DOSing virtual threads.

The case for platform threads

When I use platform threads (i.e. new platform thread per task), the threads are almost simultaneously finishes execution after the 900ms sleep is done.

With platform threads, you have time-sharing so the OS scheduler ensures that other platform threads actually get to work by stopping the virtual threads.

You are not measuring 0.1s of CPU time

Your CPU bound code isn't doing 0.1s of CPU time but instead it's keeping the CPU busy (which you noted in your comment) for at least 0.1s. However, nothing sais these 0.1s are exclusive to that thread. Time sharing allows that other platform threads use the CPU while your code is executing.

For simplicity, let's assume you have one CPU core and 2 platform threads. It is possible that thread 0 starts executing long startTime = System.currentTimeMillis(); and then the scheduler immediately decides to switch to thread 1 which also executes long startTime = System.currentTimeMillis();. Then, both threads can use the CPU but as soon as 0.1s are passed since the measurement, one thread detects it and moves to the Thread.sleep(). The scheduler can then switch to the other thread which also detects that 0.1s have been passed since the measurement and moves on to the Thread.sleep() as well. So, with 2 threads you could do "half as much CPU work" for each thread but it still keeps the CPU busy for at least 0.1s in the metaphorical eyes of both threads. So, you can have multiple platform threads executing your mostIOLowCPU method within 1 second.

With virtual threads, your code would be closer to "do 0.1s of Java CPU work" (because they don't have time sharing) but with platform threads it's just "do something for the sake of doing something until 0.1s passed".

If you really want to measure 0.1s of CPU time, you would need to come up with some actual CPU bound work that takes around that time, for example encrypting or hashing some random numbers in an expensive way or whatever but you should make sure to actually use the result in some way (e.g. include the result in the response, this is useful against dead code elimination) and be aware the method is warmed up before making any measurement (i.e. it should take around 0.1s after you already executed it at least 10k times) and it the computation shouldn't have reusable results.

Let's experiment and change the code

In your testing code, you are executing CPU bound tasks for 100 milliseconds. Executing work for 100ms isn't always the same amount of work depending on what your system is doing somewhere else (e.g. if the platform threads are busy). So, let's try the following:

//WARNING: THIS IS NOT GOOD CODE FOR ANY BENCHMARKING
public int mostIOLowCPU() throws InterruptedException {
    int count=0;//I added a counter here
    
    long startTime = System.currentTimeMillis();
    
    while (System.currentTimeMillis() - startTime < 100) {
        double x = Math.random();  // Random operation to keep CPU busy
        count++;//count the number of iterations
    }
    Thread.sleep(900);

    return count;
}

We can now test it with both virtual and platform threads:

//WARNING: THIS IS NOT A PROPER BENCHMARK
//virtual threads
ExecutorService service = Executors.newVirtualThreadPerTaskExecutor();
for(int i=0; i<100;i++){
    service.submit(() -> {
        System.out.println(mostIOLowCPU());
        return null;
    });
}

When I tested that, I got values between 10 000 and 200 000 but it takes long to execute.

Now, let's do this with platform threads.

//platform threads
//WARNING: THIS IS NOT A PROPER BENCHMARK
ExecutorService service = Executors.newCachedThreadPool();
for(int i=0; i<100;i++){
    service.submit(() -> {
        System.out.println(mostIOLowCPU());
        return null;
    });
}

With platform threads, I got values between 1 500 and 200 000 with many executions in the range of 1 000 - 10 000 but it completes quickly.

Note that this is not a good benchmark at all (don't assume anything from my code here), I just want to show you what virtual threads are doing. For a proper benchmark, you should use tools like JMH (and also make sure you are blackholing the result of Math.random(), you probably don't want do simulate CPU work by waiting to "complete n milliseconds of work on the CPU" etc).

JMH?

If you want to use JMH, you can try running a benchmark similar to the following:

@Threads(100)
@Fork(value=2/*, jvmArgsAppend = "-Djmh.executor=VIRTUAL"*/)//uncomment for virtual threads
@Warmup(time = 3, iterations = 3)
@Measurement(time = 3, iterations = 5)
public class VirtualThreadsBenchmark {
    
    @Benchmark
    public void run(Blackhole blackhole) throws InterruptedException {
        Blackhole.consumeCPU(50_000_000);//consume 50_000_000 tokens of "CPU work", JMH tries to ensure these take approximately the same time - 50_000_000 tokens are around 100ms on my device with a single thread
        Thread.sleep(900);
    }
}

Here, I'm not really getting a significant difference between virtual and platform threads but I ran just ran it on a laptop to see what happens so my results are very noisy.

With 1000 threads and only 5_000_000 CPU tokens, I am getting slightly better results with virtual threads (which could make sense due to less context switching during the CPU work but it could as well just be noise).

What you can do when facing this problem with real code

If you have this issue in an application, just let platform threads do the work.

public static final ExecutorService cpuBoundExecutor = Executors.newFixedThreadPool(8);//just an example

Future cpuWork = cpuBoundExecutor.submit(() -> { // perform CPU intensive operation using platform threads
    long startTime = System.currentTimeMillis();
    
    while (System.currentTimeMillis() - startTime < 100) {
        double x = Math.random();  // Random operation to keep CPU busy
    }
    return null;//you can return a result here if you want
});
/*var result = */ cpuWork.get();//wait for the result//TODO handle exceptions (this is just for demonstration)

//perform IO in virtual thread
Thread.sleep(900);