Reputation: 15824
One of my redis servers is repeatedly going down today without any overt, diagnosable cause. My users all end up getting Error 111 connecting to unix socket: /var/run/redis/redis2.sock. Connection refused
errors.
Looking into the logs at /var/log/redis
, the last few lines capture nothing more nefarious than a scheduled backup:
[8248] 09 Mar 07:48:17.090 * 10 changes in 21600 seconds. Saving...
[8248] 09 Mar 07:48:17.374 * Background saving started by pid 47613
[47613] 09 Mar 07:51:02.257 * DB saved on disk
[47613] 09 Mar 07:51:02.486 * RDB: 526 MB of memory used by copy-on-write
[8248] 09 Mar 07:51:02.920 * Background saving terminated with success
The pid file still exists too. Which implies the server wasn't formally shut down, and redis was still daemonized?
I logged into my system and did sudo service redis-server restart
twice to get it up and running. Apart from these logs, how else can I diagnose what might have gone wrong?
Update: I noticed that at the time of the first crash, disk swapping started taking place. This hasn't happened before. Moreover, cat /proc/sys/vm/swappiness
confirms swappiness is set to 2
.
free -m
shows (after normal operation):
total used free shared buffers cached
Mem: 28136 27015 1120 305 80 6586
-/+ buffers/cache: 20349 7787
Swap: 1023 991 32
free -m
shows (after the redis server goes down):
total used free shared buffers cached
Mem: 28136 8770 19365 305 60 441
-/+ buffers/cache: 8268 19868
Swap: 1023 1022 1
Upvotes: 0
Views: 2027
Reputation: 49942
This sounds like the work of the OS' OOM killer - you can verify/discredit the hypothesis by reviewing the /var/log/syslog
.
In this case, the persistence job's overhead triggered the killer. You need to provision for that by setting maxmemory
and allocating enough RAM to accommodate persistence's requirements, including COW.
Note that free
isn't useful after the fact - you need to monitor your resources continuously.
As for swap, if you don't care about latency then you can certainly do that.
Upvotes: 3