Reputation: 33
Any tips for troubleshooting would be appreciated.
We are running a large server with multiple sites m6a.24Xlarge. We had a 15TB EBS volume that with all of our websites on it. As the limit was quickly approaching, we decided to switch to EFS with the long-term goal of adding load balancing. Before implementing this, we put one of our largest customers on the EFS drive. There were no performance issues.
Slowly, I began to transfer sites over to the EFS volume creating symlinks to the EFS volume on the EBS. While transferring, I set cold access(IA) to 1 day to reduce the overall data storage costs during the transition phase. Once the initial transfer was completed, I performed a delta transfer and switched each site one at a time. IA set to 30 days.
Everything slowed down greatly once we reached the last 25% of the sites. I had thought that maybe it was data being transferred out of Cold storage (Infrequently Accessed). Performance did improve as data moved out of IA but we are still seeing issues 2 weeks later and the issue below has me to belive that we are hitting a bottleneck I can't locate.
When I switched everything to the EFS mount, the server would not work at all with the plugins folder in EFS (all sites use this folder for wp-contents/plugins via a symlink). I tried with all files out of IA (standard one-zone) but it still wouldn't work. This, I think, shows an example of the bottleneck we are seeing when the server comes under load. I ended up moving the plugins folder to a local EBS mount. This is now working fine as long as we don't get hit with higher traffic loads.
During medium/high traffic periods, CPU load is spiking above 700 (on a 96-core system) while overall CPU usage sits at exactly the same place between 30-40%. On the EBS volume, our CPU usage ranged from 30-70% depending on traffic. While CPU load spikes, PHP-FPM Workers shoot way up and sit there in D status. They appear to be the load waiting to be executed by the CPU. This causes overall slowdowns for our sites. Increasing workers for Apache or PHP does not seem to change the CPU usage.
Troubleshooting
ps -ax | grep php | grep -c D
is showing high numbers under medium/high loads. When this goes up, sites get slow.nfsiostat
is showing everything in low ms. I am noticing that the nfsiostat doesn't seem to change much no matter what the load is on the server. (UPDATE -- seems to be changing more than I thought second image is under load)ulimit
to max. This did seem to help a bit. I also played around with settings for somaxconn and tcp_max_syn_backlog without any obvious effect. mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 172.00.00.1:/ /mnt/sitefiles
Apache
</IfModule>
ServerLimit 4000
StartServers 21
MinSpareThreads 400
MaxSpareThreads 1024
ThreadsPerChild 200
MaxRequestWorkers 7000
MaxConnectionsPerChild 0
</IfModule>
PHP-FPM
pm.max_children = 2000
Based on the flat performance of the CPU and nfsiostat, my gut says we are hitting a default network/system bottleneck somewhere. I've been unable to locate what this could be. If anyone has any advice on what to look at, please let me know. Any input would be greatly appreciated!
Upvotes: 1
Views: 843
Reputation: 33
After some research and testing, the following TCP settings appear to have helped get things back to normal.
sudo sysctl -w net.core.rmem_max=2097152
sudo sysctl -w net.core.wmem_max=2097152
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 2097152"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 2097152"
As mentioned before, I also increased the ulimit which would also be prudent if you see this on your server:
ulimit -n 1000000 #or the highest number your server can handle
The bottleneck appears to have been related to the increase in the network activity.
Upvotes: 1