Reputation: 1177
Sorry for the longish post, here's TLDR: For each instance of Google Cloud Engine AMD-powered VMs, 1vCPU is crippled in some way compared to the rest. Any idea how/why?
I did a performance/value analysis for various instance types the Google Compute Engine provides and found that for our workloads the AMD EPYC Milan-powered n2d
types offered the best performance and value. I then extended the comparison to other cloud providers, you can see a detailed cloud provider performance/value comparison here (perl workloads, as well as compiling and Geekbench thrown in for good measure) and in the course of this, as I was trying to calculate things like scalability, I could see something odd happening just with Google's AMD EPYC VMs: If you created a 2xvCPU, 4xvCPU or 8xvCPU (did not try further) AMD Rome (n2d
) or AMD Milan (n2d
, t2d
, c2d
) instance, 1 of the vCPUs is not the same as the others, performing at times significantly worse (depending on the workload, even over 50% worse). An exception is 2xvCPU t2d
or Rome-n2d
, in which case sometimes you can get BOTH vCPUs be the "slow" type.
The issue shows up as a significant performance variance when running single-threaded benchmarks, as the vCPUs appear the same to the scheduler so it's sort of a matter of luck which one ends up handling the load. But it is very clear if you use taskset
to set the processor affinity of the process. So, taking Geekbench for example on a c2d
where CPU 0 is the "slow" one we run:
taskset 1 ./geekbench5
And get a single core result of 986 (multi-core runs 2 threads on that single vCPU so is similar). Then try running on the other vCPU:
taskset 2 ./geekbench5
And get what the EPYC Milan can actually do, which is 1266 in this case.
If you look at the benchmark breakdown, several benchmarks seem unaffected, being within 5% either side of performance between the two runs, but there are some big differences, with AES-XTS
being 3x faster on the second core! Here is a table with the fast vs slow vCPU results on the various AMD Milan (M) and Rome (R) instances:
vCPU | n2d-s2 (M) | n2d-s2 (R) A | c2d-s2 (M) | t2d-s2 (M) |
---|---|---|---|---|
fast | 1251 | 970 | 1266 | 1150 |
slow | 979 | 788 | 986 | 893 |
If you have an 8xvCPU instance only 1 of those cores will have the performance issue, so it will affect you less. There is actually not pattern which #vCPU is the problematic, e.g. if you have an 8xvCPU instance you would try parameters 1, 2, 4, 8, 16, 32, 64, 128 for taskset (it takes a bitmask) and either of those can be the one.
I tried the LMbench microbenchmarks to see if there was any difference in the memory latency timings, in case the slow core doesn't get access to all the caches etc, but all the LMbench memory benchmarks gave similar results for fast vs slow cores, apart from libc bcopy
and Memory bzero bandwidth
, which reported over twice the bandwidth for the non-affected CPU for the first 512b - 1k, with the slow CPU slowly catching up after the 4k mark:
fast vCPU slow vCPU
libc bcopy unaligned
0.000512 74850.98 0.000512 39376.69
0.001024 102429.05 0.001024 56302.91
0.002048 104352.51 0.002048 74090.38
0.004096 108161.33 0.004096 90174.68
0.008192 97034.51 0.008192 85216.90
0.016384 99009.57 0.016384 93743.92
0.032768 54218.61 0.032768 52910.72
0.065536 53300.89 0.065536 49660.89
0.131072 50072.18 0.131072 51533.84
libc bcopy aligned
0.000512 82067.77 0.000512 38346.13
0.001024 103010.95 0.001024 55810.31
0.002048 104568.18 0.002048 72664.92
0.004096 105635.03 0.004096 85124.44
0.008192 91593.23 0.008192 85398.67
0.016384 93007.97 0.016384 91137.35
0.032768 51232.94 0.032768 49939.64
0.065536 49703.80 0.065536 49675.92
0.131072 49760.35 0.131072 49396.37
Memory bzero bandwidth
0.000512 83182.36 0.000512 43423.32
0.001024 95353.76 0.001024 61157.60
0.002048 103437.22 0.002048 76770.77
0.004096 70911.40 0.004096 61986.23
0.008192 84881.63 0.008192 77339.78
0.016384 95343.37 0.016384 87949.77
0.032768 97565.34 0.032768 91436.64
0.065536 93136.11 0.065536 89826.23
0.131072 95790.48 0.131072 90689.07
All the other benchmarks, including unrolled bcopy and memory read/writes, latency etc were within the margin of error between the vCPUs. I am not sure what this could tell us, in general I am at a loss on how this is happening (some sort of google hypervisor bug?), and why no-one seems to have noticed something that was quite obvious from the start of my benchmarking - I find no references googling for it. And, surely you test performance when you design a cloud solution, this couldn't have gone past any QA in the first place.
I can't see what I may be doing wrong on my side, I did try apart from Debian Bullseye other distros, and saw the same with Debian Buster, Ubuntu, CentoOS. A bit more details on things I tried are at the last part of my aforementioned cloud performance comparison blog post. Anyone with any insight into this, I'd be curious to know what's going on.
Upvotes: 5
Views: 1327
Reputation: 1177
Just to let you know, this is Google's official answer:
I believe the behaviour that has been captured here is the underlying Compute Engine resource that Google Cloud uses within its hypervisor to run essential networking and management tasks. As a result of this component, CPU may not be able to reach 100% on all cores. Smaller instances may see this more so due to the relative size of the component to the size of the machine.
The overall stance of this symptom at this time is that this is part of the infrastructure and working as intended. You're welcome to choose other CPU architecture types, or use use a larger machine type; either of these should make up for the overhead which is taken up by the hypervisor resource. I've still forwarded your findings, reproduction steps and comments to the Compute Engine team for their record.
It sounds a bit reasonable, until you start wondering why it only happens with EPYC instances, why nothing similar happens with other provider's EPYC instances, why it is random whether 1 or 2 vCPUs of your instance are affected, and why it only affects small writes.
So, for me, it can't exactly be "working as intended", but I don't think it's worth pursuing further with them, as I said it's not a big issue, unless of course you happen to run something very specific. E.g. AES encryption on 4k blocks will run at just 1/3rd the speed on the compromised vCPU.
Update 11/2022:
The problem seems to have gone for new n2d
and t2d
instances for me. It is still a problem with c2d
instances, which were a waste of money IMHO anyway (not at all faster than the cheaper types), so they are to be avoided.
Update 03/2023:
Not only are the c2d
instances fixed as well, but I see they finally got increased clock speeds / performance vs n2d
, so there is a reason to pay more for them.
Upvotes: 4