RabbitMQ application stops when another node in cluster is shutdown

Question

I am new to RabbitMQ and I have troubles when handling RabbitMQ cluster.

The topology is like:

At first, every is ok. RabbitMQ node1 and RabbitMQ node2 are in a cluster. They are interconnected by a RabbitMQ plugin called autocluster.

Then I delete pod rabbitmq-1 by kubectl delete pod rabbitmq-1. And I found that RabbitMQ application in node1 is stopped. I don't understand why RabbittoMQ will stop application if it detects another node's failure. It does not make sense. Is this behaviour designed by RabbitMQ or autocluster? Can you enlighten me?

My config is like:

[
  {rabbit, [
    {tcp_listen_options, [
                          {backlog,       128},
                          {nodelay,       true},
                          {linger,        {true,0}},
                          {exit_on_close, false},
                          {sndbuf,        12000},
                          {recbuf,        12000}
                         ]},
    {loopback_users, [<<"guest">>]},
    {log_levels,[{autocluster, debug}, {connection, debug}]},
    {cluster_partition_handling, pause_minority},
    {vm_memory_high_watermark, {absolute, "3276MiB"}}
  ]},

  {rabbitmq_management, [
    {load_definitions, "/etc/rabbitmq/rabbitmq-definitions.json"}
  ]},

  {autocluster, [
    {dummy_param_without_comma, true},
    {autocluster_log_level, debug},
    {backend, etcd},
    {autocluster_failure, ignore},
    {cleanup_interval, 30},
    {cluster_cleanup, false},
    {cleanup_warn_only, false},
    {etcd_ttl, 30},
    {etcd_scheme, http},
    {etcd_host, "etcd.kube-system.svc.cluster.local"},
    {etcd_port, 2379}
   ]}
]

In my case, x-ha-policy is enabled.

svenwltr · Accepted Answer

You set cluster_partition_handling to pause_minority. One out of two nodes isn't the majority, so the cluster stops as configured. You either have to add an additional node or set cluster_partition_handling to ignore.

From the docs:

In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other nodes go down. It therefore chooses partition tolerance over availability from the CAP theorem. This ensures that in the event of a network partition, at most the nodes in a single partition will continue to run. The minority nodes will pause as soon as a partition starts, and will start again when the partition ends.

RabbitMQ application stops when another node in cluster is shutdown

Answers (1)

Related Questions