Why do GCP's AMD EPYC VM instances seem to have 1 "problematic" vCPU?

Question

Sorry for the longish post, here's TLDR: For each instance of Google Cloud Engine AMD-powered VMs, 1vCPU is crippled in some way compared to the rest. Any idea how/why?

I did a performance/value analysis for various instance types the Google Compute Engine provides and found that for our workloads the AMD EPYC Milan-powered n2d types offered the best performance and value. I then extended the comparison to other cloud providers, you can see a detailed cloud provider performance/value comparison here (perl workloads, as well as compiling and Geekbench thrown in for good measure) and in the course of this, as I was trying to calculate things like scalability, I could see something odd happening just with Google's AMD EPYC VMs: If you created a 2xvCPU, 4xvCPU or 8xvCPU (did not try further) AMD Rome (n2d) or AMD Milan (n2d, t2d, c2d) instance, 1 of the vCPUs is not the same as the others, performing at times significantly worse (depending on the workload, even over 50% worse). An exception is 2xvCPU t2d or Rome-n2d, in which case sometimes you can get BOTH vCPUs be the "slow" type.

The issue shows up as a significant performance variance when running single-threaded benchmarks, as the vCPUs appear the same to the scheduler so it's sort of a matter of luck which one ends up handling the load. But it is very clear if you use taskset to set the processor affinity of the process. So, taking Geekbench for example on a c2d where CPU 0 is the "slow" one we run:

taskset 1 ./geekbench5

And get a single core result of 986 (multi-core runs 2 threads on that single vCPU so is similar). Then try running on the other vCPU:

taskset 2 ./geekbench5

And get what the EPYC Milan can actually do, which is 1266 in this case.

If you look at the benchmark breakdown, several benchmarks seem unaffected, being within 5% either side of performance between the two runs, but there are some big differences, with AES-XTS being 3x faster on the second core! Here is a table with the fast vs slow vCPU results on the various AMD Milan (M) and Rome (R) instances:

vCPU	n2d-s2 (M)	n2d-s2 (R) A	c2d-s2 (M)	t2d-s2 (M)
fast	1251	970	1266	1150
slow	979	788	986	893

If you have an 8xvCPU instance only 1 of those cores will have the performance issue, so it will affect you less. There is actually not pattern which #vCPU is the problematic, e.g. if you have an 8xvCPU instance you would try parameters 1, 2, 4, 8, 16, 32, 64, 128 for taskset (it takes a bitmask) and either of those can be the one.

I tried the LMbench microbenchmarks to see if there was any difference in the memory latency timings, in case the slow core doesn't get access to all the caches etc, but all the LMbench memory benchmarks gave similar results for fast vs slow cores, apart from libc bcopy and Memory bzero bandwidth, which reported over twice the bandwidth for the non-affected CPU for the first 512b - 1k, with the slow CPU slowly catching up after the 4k mark:

fast vCPU           slow vCPU

libc bcopy unaligned
0.000512 74850.98   0.000512 39376.69
0.001024 102429.05  0.001024 56302.91
0.002048 104352.51  0.002048 74090.38
0.004096 108161.33  0.004096 90174.68
0.008192 97034.51   0.008192 85216.90
0.016384 99009.57   0.016384 93743.92
0.032768 54218.61   0.032768 52910.72
0.065536 53300.89   0.065536 49660.89
0.131072 50072.18   0.131072 51533.84

libc bcopy aligned
0.000512 82067.77   0.000512 38346.13
0.001024 103010.95  0.001024 55810.31
0.002048 104568.18  0.002048 72664.92
0.004096 105635.03  0.004096 85124.44
0.008192 91593.23   0.008192 85398.67
0.016384 93007.97   0.016384 91137.35
0.032768 51232.94   0.032768 49939.64
0.065536 49703.80   0.065536 49675.92
0.131072 49760.35   0.131072 49396.37

Memory bzero bandwidth
0.000512 83182.36   0.000512 43423.32
0.001024 95353.76   0.001024 61157.60
0.002048 103437.22  0.002048 76770.77
0.004096 70911.40   0.004096 61986.23
0.008192 84881.63   0.008192 77339.78
0.016384 95343.37   0.016384 87949.77
0.032768 97565.34   0.032768 91436.64
0.065536 93136.11   0.065536 89826.23
0.131072 95790.48   0.131072 90689.07

All the other benchmarks, including unrolled bcopy and memory read/writes, latency etc were within the margin of error between the vCPUs. I am not sure what this could tell us, in general I am at a loss on how this is happening (some sort of google hypervisor bug?), and why no-one seems to have noticed something that was quite obvious from the start of my benchmarking - I find no references googling for it. And, surely you test performance when you design a cloud solution, this couldn't have gone past any QA in the first place.

I can't see what I may be doing wrong on my side, I did try apart from Debian Bullseye other distros, and saw the same with Debian Buster, Ubuntu, CentoOS. A bit more details on things I tried are at the last part of my aforementioned cloud performance comparison blog post. Anyone with any insight into this, I'd be curious to know what's going on.

Why do GCP's AMD EPYC VM instances seem to have 1 "problematic" vCPU?

Answers (1)

Related Questions

Why do GCP&#39;s AMD EPYC VM instances seem to have 1 &quot;problematic&quot; vCPU?

Answers (1)

Related Questions

Why do GCP's AMD EPYC VM instances seem to have 1 "problematic" vCPU?