Reputation: 31
I'm currently running an ELK cluster on reasonably weak hardware (four virtual machines, with 4 GB memory assigned and two core each). This is slated to change in a couple of months, but for now we still need to ingest and make logs available.
After getting all of the servers of one service sending their logs to Logstash via nxlog, collection worked fairly well for a few days. Shortly after that, logstash frequently started to wedge open. The logstash thread 'filterworker.0' will jump to 93 and then 99% of the server's CPU. Logstash itself won't terminate; instead it will continue on, hung, never sending any fresh logs to Elasticsearch. Debug logs will show that logstash is continually calling flush by interval. It will never recover from this state; it ran an entire weekend hung and only resumed normal operations when I restarted it. Logstash would start catching up on the weekend's logs and then quickly free again (usually within five to ten minutes), requiring another restart of the service. Once the logs had been able to mostly catch up (many restarts later and some turning off of complicated grok filters), logstash returned to its previous habit of wedging open every five to thirty minutes.
I attempted to narrow this down to a particular configuration and swapped my log filters into and out of the conf.d directory. With fewer configs, logstash would run for longer periods of time (up to an hour and a half) but eventually it would freeze again.
Connecting jstack to the PID of the frozen filterworker.0 thread returned mostly 'get_thread_regs failed for a lwp' debugger exceptions and no deadlocks found.
There are no actual failures in logstash's logs when run at debug verbosity; just those buffering loglines.
The disks are not full.
Our current configuration is three elasticsearch nodes, all receiving input from the logstash server (using logstash's internal load balancer). We have a single logstash server. These are all CentOS 7 machines. The logstash machine is running version 2.1.3, sourced from Elastic's yum repository.
I've played around with changing the heap size, but nothing appears to help, so I'm currently running it at the out of the box defaults. We only use one worker thread as it's a single core virtual machine. We used to use multiline, but that was the first thing I commented out when this started to happen.
I'm not sure where to go next. My theory is that logstash's buffer is just unable to handle the current log traffic; but without any conclusive errors in the logs, I'm not sure how to prove it. I feel like it might be worth putting a redis or rabbit queue between nxlog and logstash to buffer the flood; does that seem like a reasonable next step?
Any suggestions that people might have would be greatly appreciated.
Upvotes: 0
Views: 2111
Reputation: 2265
It sounds like you need more Logstash nodes. We experienced similar outages, caused by CPU, when the log throughput went up for various reasons. We are putting on aprrox. 6K lines per second and have 6 nodes (just for reference).
Also, putting a Redis pipeline in front of the Logstash nodes allowed us to configure our Logstash nodes to pull and process accordingly. Redis has allowed our Logstash nodes to now be over provisioned as they don't bear the brunt of the traffic. They pull log entries and their utilization is more consistent (no more crashing).
Upvotes: 0
Reputation: 86
I use monit to monitor the service and check for high CPU usage and then restart Logstash according to the findings. Bit of a workaround, not really a long term solution. A queuing system would probably do the trick, check out Kafka, Redis, or RabbitMQ. You would need to measure the difference rate at which the queue is written to vs read from.
Upvotes: 0
Reputation: 66
You may try to reset the Java environment,when I start up my logstash ,it will up to 99% cpu usage,but when the JVM start over ,the cpu usage will down to 3%,so I guess,maybe your java environment have something wrong. Wish help.
Upvotes: 0