Apache / PHP-FPM Degraded Performance after switch to EFS from EBS

Question

Any tips for troubleshooting would be appreciated.

Background

m6a.24Xlarge - Red Hat Enterprise Linux release 8.7 (Ootpa)
PHP 7.4.30 (opcache)
Apache/2.4.37 (Red Hat Enterprise Linux)
200+ WordPress sites
All sites have a symlink to a central plugins repository IE /mnt/files/sites/plugins
EFS Standard one-zone 14 TB
Cache by Redis (same instance) with w3-total-cache plugin

We are running a large server with multiple sites m6a.24Xlarge. We had a 15TB EBS volume that with all of our websites on it. As the limit was quickly approaching, we decided to switch to EFS with the long-term goal of adding load balancing. Before implementing this, we put one of our largest customers on the EFS drive. There were no performance issues.

Slowly, I began to transfer sites over to the EFS volume creating symlinks to the EFS volume on the EBS. While transferring, I set cold access(IA) to 1 day to reduce the overall data storage costs during the transition phase. Once the initial transfer was completed, I performed a delta transfer and switched each site one at a time. IA set to 30 days.

Everything slowed down greatly once we reached the last 25% of the sites. I had thought that maybe it was data being transferred out of Cold storage (Infrequently Accessed). Performance did improve as data moved out of IA but we are still seeing issues 2 weeks later and the issue below has me to belive that we are hitting a bottleneck I can't locate.

When I switched everything to the EFS mount, the server would not work at all with the plugins folder in EFS (all sites use this folder for wp-contents/plugins via a symlink). I tried with all files out of IA (standard one-zone) but it still wouldn't work. This, I think, shows an example of the bottleneck we are seeing when the server comes under load. I ended up moving the plugins folder to a local EBS mount. This is now working fine as long as we don't get hit with higher traffic loads.

Issue

During medium/high traffic periods, CPU load is spiking above 700 (on a 96-core system) while overall CPU usage sits at exactly the same place between 30-40%. On the EBS volume, our CPU usage ranged from 30-70% depending on traffic. While CPU load spikes, PHP-FPM Workers shoot way up and sit there in D status. They appear to be the load waiting to be executed by the CPU. This causes overall slowdowns for our sites. Increasing workers for Apache or PHP does not seem to change the CPU usage.

Troubleshooting

Attempted to increase PHP workers and Apache threads with zero effect positive or negative
Using nload, Network load peaks at around 1.5GBps incoming (or 2GBps total). This doesn't appear to be a bottleneck
RAM doesn't change much with an average 170 GB out of 370 GB
strace on PHP-FPM worker is not showing anything strange at all. Everything appears to be finishing without any obvious hangups
ps -ax | grep php | grep -c D is showing high numbers under medium/high loads. When this goes up, sites get slow.
EFS stats - IO usage is sitting around 35% with some spikes but never maxing out. Throughput usage is under 30%. We haven't touched our burst credits.
nfsiostat is showing everything in low ms. I am noticing that the nfsiostat doesn't seem to change much no matter what the load is on the server. (UPDATE -- seems to be changing more than I thought second image is under load)

Low traffic

Medium Traffic

I've tried looking at a tcpdump from the server in wireshark but I'm not able to locate anything obvious. My abilities here are limited. I did find a good amount of connections with RTT of 7 seconds.
Increased ulimit to max. This did seem to help a bit. I also played around with settings for somaxconn and tcp_max_syn_backlog without any obvious effect.
EBS volume is not close to its throughput (400mb) / IOPS (8000) limit

EFS mount command

mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 172.00.00.1:/ /mnt/sitefiles

Apache


    ServerLimit              4000
    StartServers              21
    MinSpareThreads           400
    MaxSpareThreads          1024
    ThreadsPerChild           200
    MaxRequestWorkers      7000
    MaxConnectionsPerChild 0

PHP-FPM

pm.max_children  = 2000

Next Steps?

Based on the flat performance of the CPU and nfsiostat, my gut says we are hitting a default network/system bottleneck somewhere. I've been unable to locate what this could be. If anyone has any advice on what to look at, please let me know. Any input would be greatly appreciated!

Apache / PHP-FPM Degraded Performance after switch to EFS from EBS

Background

Issue

EFS mount command

Next Steps?

Answers (1)

Related Questions