LittleLebowski
LittleLebowski

Reputation: 7941

Site becomes inaccessible due to PHP-FPM listen queue, CPU touches 100%

I've been racking my brains apart trying to solve this issue that comes up randomly every few hours on my production server hosting a single Wordpress blog (with decent traffic: 2000 users realtime on average days, 5000+ on good days, pageviews per minute varies from 300 to 700+).

I use Newrelic to monitor performance and I've noticed a peculiar thing:

Every few hours (randomly), the PHP-FPM pool status goes something like below (real status taken yesterday)

pool:                 www
process manager:      static
start time:           02/Jan/2017:05:03:16 -0500
start since:          27290
accepted conn:        1107594
listen queue:         777
max listen queue:     794
listen queue len:     40000
idle processes:       0
active processes:     100
total processes:      100
max active processes: 101
max children reached: 0
slow requests:        0

Restarting PHP-FPM and nginx solves the issue but it happens again in a couple of hours. Any help is appreciated. Please guide me.


Server setup:

DigitalOcean 48GB Memory
16 Core Processor
480GB SSD Disk

PHP-FPM pool setting:

pm = static
pm.max_children = 100
pm.max_requests = 5000

nginx config:

worker_processes  32;
worker_rlimit_nofile 100000;
events {
    worker_connections  40000;
    use epoll;
    multi_accept on;
}

I'm also using xcache, varnish with W3TC on Wordpress. (also have Cloudflare)

sysctl.conf:

# Increase size of file handles and inode cache
fs.file-max = 2097152

# Do less swapping
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2

### GENERAL NETWORK SECURITY OPTIONS ###

# Number of times SYNACKs for passive TCP connection.
net.ipv4.tcp_synack_retries = 2

# Allowed local port range
net.ipv4.ip_local_port_range = 2000 65535

# Protect Against TCP Time-Wait
net.ipv4.tcp_rfc1337 = 1

# Decrease the time default value for tcp_fin_timeout connection
net.ipv4.tcp_fin_timeout = 15

# Decrease the time default value for connections to keep alive
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15

### TUNING NETWORK PERFORMANCE ###

# Default Socket Receive Buffer
net.core.rmem_default = 31457280

# Maximum Socket Receive Buffer
net.core.rmem_max = 12582912

# Default Socket Send Buffer
net.core.wmem_default = 31457280

# Maximum Socket Send Buffer
net.core.wmem_max = 12582912

# Increase number of incoming connections
net.core.somaxconn = 40000

# Increase number of incoming connections backlog
net.core.netdev_max_backlog = 65536

# Increase the maximum amount of option memory buffers
net.core.optmem_max = 25165824

# Increase the maximum total buffer-space allocatable
# This is measured in units of pages (4096 bytes)
net.ipv4.tcp_mem = 65536 131072 262144
net.ipv4.udp_mem = 65536 131072 262144

# Increase the read-buffer space allocatable
net.ipv4.tcp_rmem= 10240 87380 12582912
net.ipv4.udp_rmem_min = 16384

# Increase the write-buffer-space allocatable
net.ipv4.tcp_wmem= 10240 87380 12582912
net.ipv4.udp_wmem_min = 16384

# Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

Upvotes: 9

Views: 2107

Answers (2)

Argus Duong
Argus Duong

Reputation: 2654

Have you check your access.log or domain.com.access.log at /var/log/nginx/? Looking to that you will have more detail why PHP-FPM eating your CPU.

I think your website is on a brute-force to wp-login.php, that's consume a lot of CPU.

Upvotes: 0

user7196449
user7196449

Reputation:

Try stopping your NewRelic agent and waiting a few hours to see if that resolves the issue. If it does, then try upgrading it to the latest version. If it comes back once it's upgraded, contact NewRelic support.

Check the max_execution_time and request_terminate_timeout in your php.ini.

Check the proxy_connect_timeout, proxy_send_timeout, proxy_read_timeout, and send_timeout values in the Nginx config as well.

I would recommend checking your TCP/IP settings to see what those are as keep alive and time out settings there may need to be reduced. I've seen some distros come with a minute or more by default.

You should also verify that the traffic from the listener is valid traffic. See if you can out put samples to a file and validate that the traffic is legit. Many automated processes seek out Wordpress instances on the interwebz. These bots can cause all kinds of problems as they to hack your site.

Upvotes: 1

Related Questions