Cassandra node JVM hang during node repair a table with materialized view

Question

I have a 9 nodes cluster on AWS. Recently, some nodes were down and I want to repair the cluster after I restarted them. But I found the repair operation causes lots of memtable flush and then the JVM GC failed. Consequently, the node hang.

I am using the cassandra 3.1.0.

java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b32)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b32, mixed mode)

The node hardware is 32GB mem and 4 cores CPU. The heap is 16GB. For each node, about 200 GB sstables.

The JVM hang is very fast. After the repair process starts, everything works. I checked the memory, cpu and IO. No stress found. After random time (the streaming task is completing), the memtableflushwriter pending task increasing very fast, and then GC failed. The JVM hang and the heapdump created. When the issue happened, the CPU is in a low usage, and I cannot find I/O latency on AWS EBS disk metrics.

I checked the heap dump file. There are several big memtables objects of the table repairing. The memtable objects size is about 400 - 700MB. And the memtables are created in 20 seconds. In addition, I can see more than 12000 memtables. In these memtables, there are 6000 sstable_activity memtables.

At first, I suspect the memtable flush writer is the bottleneck. So I increase it to 4 threads. And I double the memory of the node. But it doesn't work. During repairing, the pending task increasing fast and then the node hang again. I also decrease the repair token range, only one vnode, but still failed.

We can see some logs like this

WARN [STREAM-IN-/10.0.113.12:7000] 2020-04-02 05:05:57,150 BigTableWriter.java:211 - Writing large partition ....

The writing sstables have 300 - 500 MBs. Some big one reaches 2+ GB.

I go through the cassandra source code. And I found the sstables must be processed in normal write process if the table has a materialized view. So I suspect the issue occurs in COMPLETE stage in streaming.

After streaming, the receive callback function loads the updated partition sstables and create mutation as normal writes. So it increases memtables in heap. In addition, it also invoke flush() which will create extra memtables besides repaired tables. The memtables size exceeds the clean up threshold. So flush is called. But flush cannot free enough memories. So many times of flush called. On the other hand, the flush will increase the memtables, too.

So anyone meets the same issue? If my conclusion is correct, how to fix it?

Cassandra node JVM hang during node repair a table with materialized view

Answers (1)

Related Questions