hellogb
hellogb

Reputation: 11

Cassandra node JVM hang during node repair a table with materialized view

I have a 9 nodes cluster on AWS. Recently, some nodes were down and I want to repair the cluster after I restarted them. But I found the repair operation causes lots of memtable flush and then the JVM GC failed. Consequently, the node hang.

I am using the cassandra 3.1.0.

java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b32)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b32, mixed mode)

The node hardware is 32GB mem and 4 cores CPU. The heap is 16GB. For each node, about 200 GB sstables.

The JVM hang is very fast. After the repair process starts, everything works. I checked the memory, cpu and IO. No stress found. After random time (the streaming task is completing), the memtableflushwriter pending task increasing very fast, and then GC failed. The JVM hang and the heapdump created. When the issue happened, the CPU is in a low usage, and I cannot find I/O latency on AWS EBS disk metrics.

I checked the heap dump file. There are several big memtables objects of the table repairing. The memtable objects size is about 400 - 700MB. And the memtables are created in 20 seconds. In addition, I can see more than 12000 memtables. In these memtables, there are 6000 sstable_activity memtables.

At first, I suspect the memtable flush writer is the bottleneck. So I increase it to 4 threads. And I double the memory of the node. But it doesn't work. During repairing, the pending task increasing fast and then the node hang again. I also decrease the repair token range, only one vnode, but still failed.

We can see some logs like this

WARN [STREAM-IN-/10.0.113.12:7000] 2020-04-02 05:05:57,150 BigTableWriter.java:211 - Writing large partition ....

The writing sstables have 300 - 500 MBs. Some big one reaches 2+ GB.

I go through the cassandra source code. And I found the sstables must be processed in normal write process if the table has a materialized view. So I suspect the issue occurs in COMPLETE stage in streaming.

After streaming, the receive callback function loads the updated partition sstables and create mutation as normal writes. So it increases memtables in heap. In addition, it also invoke flush() which will create extra memtables besides repaired tables. The memtables size exceeds the clean up threshold. So flush is called. But flush cannot free enough memories. So many times of flush called. On the other hand, the flush will increase the memtables, too.

So anyone meets the same issue? If my conclusion is correct, how to fix it?

Upvotes: 1

Views: 452

Answers (1)

Alex Ott
Alex Ott

Reputation: 87154

Repair in Cassandra doesn't use memtable - it uses the same streaming mechanism that is used for bootstrap of the nodes, etc. But if you have big partitions, and they are damaged, then Cassandra will need to send them, and on receiver side it will need to build auxiliary structures, etc. You can find more information on possible problems with repair in following blog post.

One of the possible solutions is to use range repair, so you can check only specific portions of the token ring. But doing this manually is tedious task, so it's better to use tool like Cassandra Reaper to automate this process.

Upvotes: 1

Related Questions