Hassan Baig
Hassan Baig

Reputation: 15824

Diagnosing unexpected redis-server failure

One of my redis servers is repeatedly going down today without any overt, diagnosable cause. My users all end up getting Error 111 connecting to unix socket: /var/run/redis/redis2.sock. Connection refused errors.

Looking into the logs at /var/log/redis, the last few lines capture nothing more nefarious than a scheduled backup:

[8248] 09 Mar 07:48:17.090 * 10 changes in 21600 seconds. Saving...
[8248] 09 Mar 07:48:17.374 * Background saving started by pid 47613
[47613] 09 Mar 07:51:02.257 * DB saved on disk
[47613] 09 Mar 07:51:02.486 * RDB: 526 MB of memory used by copy-on-write
[8248] 09 Mar 07:51:02.920 * Background saving terminated with success

The pid file still exists too. Which implies the server wasn't formally shut down, and redis was still daemonized?

I logged into my system and did sudo service redis-server restart twice to get it up and running. Apart from these logs, how else can I diagnose what might have gone wrong?


Update: I noticed that at the time of the first crash, disk swapping started taking place. This hasn't happened before. Moreover, cat /proc/sys/vm/swappiness confirms swappiness is set to 2.

free -m shows (after normal operation):

             total       used       free     shared    buffers     cached
Mem:         28136      27015       1120        305         80       6586
-/+ buffers/cache:      20349       7787
Swap:         1023        991         32

free -m shows (after the redis server goes down):

             total       used       free     shared    buffers     cached
Mem:         28136       8770      19365        305         60        441
-/+ buffers/cache:       8268      19868
Swap:         1023       1022          1

Upvotes: 0

Views: 2027

Answers (1)

Itamar Haber
Itamar Haber

Reputation: 49942

This sounds like the work of the OS' OOM killer - you can verify/discredit the hypothesis by reviewing the /var/log/syslog.

In this case, the persistence job's overhead triggered the killer. You need to provision for that by setting maxmemory and allocating enough RAM to accommodate persistence's requirements, including COW.

Note that free isn't useful after the fact - you need to monitor your resources continuously.

As for swap, if you don't care about latency then you can certainly do that.

Upvotes: 3

Related Questions