Gnosis
Gnosis

Reputation: 33

Apache / PHP-FPM Degraded Performance after switch to EFS from EBS

Any tips for troubleshooting would be appreciated.

Background

We are running a large server with multiple sites m6a.24Xlarge. We had a 15TB EBS volume that with all of our websites on it. As the limit was quickly approaching, we decided to switch to EFS with the long-term goal of adding load balancing. Before implementing this, we put one of our largest customers on the EFS drive. There were no performance issues.

Slowly, I began to transfer sites over to the EFS volume creating symlinks to the EFS volume on the EBS. While transferring, I set cold access(IA) to 1 day to reduce the overall data storage costs during the transition phase. Once the initial transfer was completed, I performed a delta transfer and switched each site one at a time. IA set to 30 days.

Everything slowed down greatly once we reached the last 25% of the sites. I had thought that maybe it was data being transferred out of Cold storage (Infrequently Accessed). Performance did improve as data moved out of IA but we are still seeing issues 2 weeks later and the issue below has me to belive that we are hitting a bottleneck I can't locate.

When I switched everything to the EFS mount, the server would not work at all with the plugins folder in EFS (all sites use this folder for wp-contents/plugins via a symlink). I tried with all files out of IA (standard one-zone) but it still wouldn't work. This, I think, shows an example of the bottleneck we are seeing when the server comes under load. I ended up moving the plugins folder to a local EBS mount. This is now working fine as long as we don't get hit with higher traffic loads.

Issue

During medium/high traffic periods, CPU load is spiking above 700 (on a 96-core system) while overall CPU usage sits at exactly the same place between 30-40%. On the EBS volume, our CPU usage ranged from 30-70% depending on traffic. While CPU load spikes, PHP-FPM Workers shoot way up and sit there in D status. They appear to be the load waiting to be executed by the CPU. This causes overall slowdowns for our sites. Increasing workers for Apache or PHP does not seem to change the CPU usage.

Troubleshooting

Low traffic Low traffic

Medium Traffic Medium traffic

EFS mount command

mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 172.00.00.1:/ /mnt/sitefiles

Apache

</IfModule>
    ServerLimit              4000
    StartServers              21
    MinSpareThreads           400
    MaxSpareThreads          1024
    ThreadsPerChild           200
    MaxRequestWorkers      7000
    MaxConnectionsPerChild 0
</IfModule>

PHP-FPM

pm.max_children  = 2000

Next Steps?

Based on the flat performance of the CPU and nfsiostat, my gut says we are hitting a default network/system bottleneck somewhere. I've been unable to locate what this could be. If anyone has any advice on what to look at, please let me know. Any input would be greatly appreciated!

Upvotes: 1

Views: 843

Answers (1)

Gnosis
Gnosis

Reputation: 33

After some research and testing, the following TCP settings appear to have helped get things back to normal.

sudo sysctl -w net.core.rmem_max=2097152
sudo sysctl -w net.core.wmem_max=2097152
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 2097152"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 2097152"

As mentioned before, I also increased the ulimit which would also be prudent if you see this on your server:

ulimit -n 1000000 #or the highest number your server can handle

The bottleneck appears to have been related to the increase in the network activity.

Upvotes: 1

Related Questions