Stale rabbitmq data queue files eating up disk

Question

The disk on one of our test deploy servers went full. It normally runs several docker containers, several which communicate using rabbitmq (also running as a container, image rabbitmq:3.8-management). After running docker-compose down around 3Gb disk space was freed, and docker-compose up -d started everything successfully.

When looking into why the disk is full, the largest contributor is /var/lib/docker/volumes which uses 63Gb disk space.

root@...:/var/lib/docker/volumes# du -s * | sort -nr | head -20
10937880        ca7780bbddd7ce3e4cb9074fc9cad60b9c3b0bf184436ac8f0545fe513e09b31
777792  ff949d7a36874231e2cc9184efbd6a5dcbad1c5931dc97aa03f2353c07af662a
526308  ffc6a025921c6cd32336f402e5f6fc50c1e8bf1ff8c3ed0a145fbcc24d778954
524972  921e79b99d80c040968da6abd906125026d133b5ad790c455bf5912966586de6
518308  c4796d0a7866de9cdc2ae752217e17a20ff4670fdb5dc2538b11be3b9f2bc20a
517352  ef0667fbe6b3049f390ed9a32480515db1a6491d31ba8021cba351851e4177cd
516616  7a230b2ade824f92905c52b5e7ed7b29aafe685de4e5334fe893139e1266db81
515968  c81dbd0a64ae5b120a643d564e29a67a65eb0a8535c986c36872c0999ecc4d6c
514108  283b55310eba29bfa197d2cdacfb4198188df4e6a43e02ee4f0bba91a26b58e6
512200  3e761db0bec2102b1fb76a4fcfddca1ab5473d535c8060d95ac266901073db36
510316  ceef371f07df37fffe36297e13645bc45a934d7cace9008427b4320f95cbf757
509024  390ca3483aa0d7a7a78f83825fb3c6949e66f60aba40daa5dec224692e790a2f
503556  9b450cf79f63b65a1b0b5f91116c20b5fd2759c102c20df1740020598c24edcd
501164  cbdd8ef3500c5ddd3cca27a6037e3851f8698fb6b2df7b1a16c4272475101f1d
500784  e84d84ece86c2966649141f2eb362f66c9e29e33eb1eebdd1da8e59e8ab85b6e
500312  27935ab135130f336d4947c8a450213ea6ca18db0eab76ee46ed35cff7491e0e
491584  619edf2f5bf74795df75581ed8ed79f8c117363ffe2272f20dfc9c460345947a
491440  f6e6eaba7f584796c548c3313993f9b7999c8ad29389ac7577ebb4f24d53c8ce
488288  2f4a8295042d1abcb19cfc8f81d12ae1a0aba297c0ac3161d001c1bfb48e0ee2
488216  b73a38a4250b3e75c019b43f67afa78f2f101789b0696985f9029eccec763f0b
root@...:/var/lib/docker/volumes#

Ignoring the two largest volumes, the wast majority of the remaining disk space is consumed by various 00000001.wal files:

root@...:/var/lib/docker/volumes# find . -name '*.wal' -print0 | xargs -0 ls -lSh | head -20
-rw-r--r-- 1 systemd-coredump systemd-coredump 509M Apr 27 11:55 ./ffc6a025921c6cd32336f402e5f6fc50c1e8bf1ff8c3ed0a145fbcc24d778954/_data/mnesia/rabbit@2ca2f1b7d03e/quorum/rabbit@2ca2f1b7d03e/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 508M May 15 10:44 ./921e79b99d80c040968da6abd906125026d133b5ad790c455bf5912966586de6/_data/mnesia/rabbit@38964fede1af/quorum/rabbit@38964fede1af/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 501M May 15 09:18 ./c4796d0a7866de9cdc2ae752217e17a20ff4670fdb5dc2538b11be3b9f2bc20a/_data/mnesia/rabbit@6d158cb91613/quorum/rabbit@6d158cb91613/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 500M Apr 15 07:42 ./ef0667fbe6b3049f390ed9a32480515db1a6491d31ba8021cba351851e4177cd/_data/mnesia/rabbit@d77cbea95f3f/quorum/rabbit@d77cbea95f3f/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 500M May  7 10:23 ./7a230b2ade824f92905c52b5e7ed7b29aafe685de4e5334fe893139e1266db81/_data/mnesia/rabbit@671d1c7262ca/quorum/rabbit@671d1c7262ca/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 499M Mar 17 13:06 ./c81dbd0a64ae5b120a643d564e29a67a65eb0a8535c986c36872c0999ecc4d6c/_data/mnesia/rabbit@f18ea19f02d9/quorum/rabbit@f18ea19f02d9/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 497M May 15 11:32 ./283b55310eba29bfa197d2cdacfb4198188df4e6a43e02ee4f0bba91a26b58e6/_data/mnesia/rabbit@b316567bc9e5/quorum/rabbit@b316567bc9e5/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 493M Mar 20 08:42 ./ceef371f07df37fffe36297e13645bc45a934d7cace9008427b4320f95cbf757/_data/mnesia/rabbit@7263a2c8e417/quorum/rabbit@7263a2c8e417/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 492M Apr  2 10:25 ./390ca3483aa0d7a7a78f83825fb3c6949e66f60aba40daa5dec224692e790a2f/_data/mnesia/rabbit@6c4c9395cd3a/quorum/rabbit@6c4c9395cd3a/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 487M Mar 12 10:18 ./9b450cf79f63b65a1b0b5f91116c20b5fd2759c102c20df1740020598c24edcd/_data/mnesia/rabbit@105041bb2ba4/quorum/rabbit@105041bb2ba4/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 484M May  7 11:14 ./cbdd8ef3500c5ddd3cca27a6037e3851f8698fb6b2df7b1a16c4272475101f1d/_data/mnesia/rabbit@27161b0d4c37/quorum/rabbit@27161b0d4c37/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 484M May 19 07:07 ./e84d84ece86c2966649141f2eb362f66c9e29e33eb1eebdd1da8e59e8ab85b6e/_data/mnesia/rabbit@76d7867669d3/quorum/rabbit@76d7867669d3/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 484M Apr 20 08:19 ./27935ab135130f336d4947c8a450213ea6ca18db0eab76ee46ed35cff7491e0e/_data/mnesia/rabbit@c4292b29d07d/quorum/rabbit@c4292b29d07d/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 475M Mar 30 00:28 ./619edf2f5bf74795df75581ed8ed79f8c117363ffe2272f20dfc9c460345947a/_data/mnesia/rabbit@df35334f453b/quorum/rabbit@df35334f453b/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 475M Mar 13 01:00 ./f6e6eaba7f584796c548c3313993f9b7999c8ad29389ac7577ebb4f24d53c8ce/_data/mnesia/rabbit@6c57a04d748b/quorum/rabbit@6c57a04d748b/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 472M May 15 09:07 ./2f4a8295042d1abcb19cfc8f81d12ae1a0aba297c0ac3161d001c1bfb48e0ee2/_data/mnesia/rabbit@b7aae098a287/quorum/rabbit@b7aae098a287/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 472M Apr 28 07:13 ./b73a38a4250b3e75c019b43f67afa78f2f101789b0696985f9029eccec763f0b/_data/mnesia/rabbit@32d97a15f0e1/quorum/rabbit@32d97a15f0e1/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 470M Mar 31 23:43 ./fee6e9f04050c2761edebc3f2569c6ab3289d07628fc442b2dd6bbb532951fad/_data/mnesia/rabbit@faf5d1fedba8/quorum/rabbit@faf5d1fedba8/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 466M Mar 27 08:33 ./9ffbbccf79f10e9202a90759796759b1ebfc2bae856f7cb724f33e67da396eda/_data/mnesia/rabbit@e369b70d7c62/quorum/rabbit@e369b70d7c62/00000001.wal
-rw-r--r-- 1 systemd-coredump systemd-coredump 465M Apr 23 08:23 ./15d310ed0df1006e9ed74782bd6bd12b5e263689269f7d7aa6bc261ff5293d43/_data/mnesia/rabbit@9d17f960a453/quorum/rabbit@9d17f960a453/00000001.wal
xargs: ls: terminated by signal 13

After starting up the containers, the system is idle and there is no messages being pushed, so it does not make sense to have large queue files on disk.

The current node is present in only one volume

root@...:/var/lib/docker/volumes# find . -name 'rabbit@3009672*' -print0 | xargs -0 ls -ld
drwxr-xr-x 4 systemd-coredump systemd-coredump 4096 Jul 27 10:38 ./ff949d7a36874231e2cc9184efbd6a5dcbad1c5931dc97aa03f2353c07af662a/_data/mnesia/rabbit@3009672c1fd1
-rw-r--r-- 1 systemd-coredump systemd-coredump   64 Jul 27 10:35 ./ff949d7a36874231e2cc9184efbd6a5dcbad1c5931dc97aa03f2353c07af662a/_data/mnesia/rabbit@3009672c1fd1-feature_flags
-rw-r--r-- 1 systemd-coredump systemd-coredump    3 Jul 27 10:35 ./ff949d7a36874231e2cc9184efbd6a5dcbad1c5931dc97aa03f2353c07af662a/_data/mnesia/rabbit@3009672c1fd1.pid
drwxr-xr-x 8 systemd-coredump systemd-coredump 4096 Jul 27 10:35 ./ff949d7a36874231e2cc9184efbd6a5dcbad1c5931dc97aa03f2353c07af662a/_data/mnesia/rabbit@3009672c1fd1-plugins-expand
drwxr-xr-x 2 systemd-coredump systemd-coredump 4096 Jul 27 10:35 ./ff949d7a36874231e2cc9184efbd6a5dcbad1c5931dc97aa03f2353c07af662a/_data/mnesia/rabbit@3009672c1fd1/quorum/rabbit@3009672c1fd1
root@...:/var/lib/docker/volumes#

created today, while the time stamp is quite old on the stale ones. The quorum name of the directory points to https://www.rabbitmq.com/quorum-queues.html, although all the queues are listed as classic.

So questions: How do I properly clean up these stale files, and how do I avoid having them being created in the first place?

hlovdal · Accepted Answer

How do I properly clean up these stale files, and how do I avoid having them being created in the first place?

These files were left there because the hostname of the container by default gets a random value, e.g. when starting with hostname d22b6aee1bbb then storage /var/lib/docker/volumes/7a1fae4f3751b1df2eb32362443281b6808fa2b216070e18d82cb d02deec099b/_data/mnesia/rabbit@d22b6aee1bbb is used. Next time the docker-compose down+up is run, a new directory will be used. This will result in high disk usage over time, especially on a CI/CD server where the containers are restarted very frequently.

The solution to this was to explicitly set the hostname in the docer-compose.yml file so that it becomes constant between starts and stops, e.g.

services:

   rabbitmq-server:
     image: rabbitmq:3.8-management
     logging: # https://docs.docker.com/compose/compose-file/#loggin
       driver: "json-file"
       options:
         max-size: "200k"
         max-file: "10"
     networks:
       - assetagent-network
     hostname: rabbit_node_1           # <------- THIS LINE
     restart: "unless-stopped"
     healthcheck:
       # Health check test source: https://github.com/docker-library/healthcheck/blob/master/rabbitmq/docker-healthcheck
      ...
    # Usually a good idea to mount this volume explicitly.
    volumes:
      - /some/where/rabbitmq:/var/lib/rabbitmq

Stale rabbitmq data queue files eating up disk

Answers (2)

Related Questions