Hazelcast 3.12.13 - Random disconnects in a 5 node cluster on GCP VMs

Question

We have a 5 node cluster using TCP/IP clustering running on GCP VMs. Occasionally (about once a month), we see all the nodes seeing disconnects. It always starts with a particular node with log lines like

2024-03-06T21:03:16Z {level="INFO"} hz._hzInstance_1_Prod.IO.thread-in-0 com.hazelcast.logging.LoggingServiceImpl$DefaultLogger.log (LoggingServiceImpl.java:168) [10.137.1.127]:5701 [Prod] [3.12.13] Initialized new cluster connection between /10.137.1.127:5702 and /10.137.1.127:5702

Followed by

2024-03-06T21:03:16Z {level="INFO"} hz._hzInstance_1_Prod.priority-generic-operation.thread-0 com.hazelcast.logging.LoggingServiceImpl$DefaultLogger.log (LoggingServiceImpl.java:168) [10.137.1.127]:5701 [Prod] [3.12.13] Removing Member [10.137.1.127]:5701 - ce37f22c-12d1-431b-907e-80a6a093c843 this
2024-03-06T21:03:16Z {level="INFO"} hz._hzInstance_1_Prod.priority-generic-operation.thread-0 com.hazelcast.logging.LoggingServiceImpl$DefaultLogger.log (LoggingServiceImpl.java:168) [10.137.1.127]:5701 [Prod] [3.12.13] Connection[id=6, /10.137.1.127:5702->/10.137.1.127:5702, qualifier=null, endpoint=[10.137.1.127]:5702, alive=false, type=MEMBER] closed. Reason: Removing Member [10.137.1.127]:5701 - ce37f22c-12d1-431b-907e-80a6a093c843 this, since it thinks it's already split from this cluster and looking to merge.
2024-03-06T21:03:16Z {level="INFO"} hz._hzInstance_1_Prod.priority-generic-operation.thread-0 com.hazelcast.logging.LoggingServiceImpl$DefaultLogger.log (LoggingServiceImpl.java:168) [10.137.1.127]:5701 [Prod] [3.12.13] Removing null, since it thinks it's already split from this cluster and looking to merge.

This is followed by connection closed messages from all the nodes

2024-03-06T21:03:16Z {level="INFO"} hz._hzInstance_1_Prod.IO.thread-in-2 com.hazelcast.logging.LoggingServiceImpl$DefaultLogger.log (LoggingServiceImpl.java:168) [10.137.1.127]:5701 [Prod] [3.12.13] Connection[id=4, /10.137.1.127:5701->/10.137.1.82:3358, qualifier=null, endpoint=[10.137.1.82]:5701, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side

All nodes, eventually disconnect from each other.

We looked at network logs/memory of the nodes etc. but dont see anything unusual. Any thoughts on what we should be looking for in terms of debugging ? Eventually, after restarting the nodes, everything seems to work ok. Hazelcast is configured on port 5701, why does the node attempt to connect to itself (on a different port).

Any pointers on what to look at next will be appreciated.

Hazelcast 3.12.13 - Random disconnects in a 5 node cluster on GCP VMs

Answers (0)

Related Questions