Reputation: 322
We are using Cassandra (3.10-1) cluster with 5 nodes - each with 8 cores and 23Gi memory, all in the same DC. Replication factor - 2 Consistency level - 2
Lately during scheduled repair which is being done about once a week, there are many timeouts on queries from one node - always the same node in different repairs. From looking at it logs:
debug.log shows that there are several errors when building the merkel tree for repair.
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,314 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in user_device_validation_configuration
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,315 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/user_device_validation_configuration, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,318 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest@5c1c2b28
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,319 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF raw_sensors
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,319 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in raw_sensors
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,319 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/raw_sensors, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:114 - Validating ValidationRequest{gcBefore=1545349928} org.apache.cassandra.repair.messages.ValidationRequest@5c1c2b28
DEBUG [AntiEntropyStage:1] 2018-12-30 23:52:08,321 RepairMessageVerbHandler.java:142 - Got anticompaction request AnticompactionRequest{parentRepairSession=b387e320-0c8d-11e9-b894-754001cf0917} org.apache.cassandra.repair.messages.AnticompactionRequest@d4b7ed7b
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 RepairMessageVerbHandler.java:168 - Got error, removing parent repair session
ERROR [AntiEntropyStage:1] 2018-12-30 23:52:08,322 CassandraDaemon.java:229 - Exception in thread Thread[AntiEntropyStage:1,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:171) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_131]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
Caused by: java.lang.RuntimeException: Parent repair session with id = b387e320-0c8d-11e9-b894-754001cf0917 has failed.
at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:400) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:435) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:143) ~[apache-cassandra-3.10.jar:3.10]
... 7 common frames omitted
DEBUG [ValidationExecutor:973] 2018-12-30 23:52:08,323 StorageService.java:3313 - Forcing flush on keyspace keyspace1, CF mouse_interactions
DEBUG [MemtablePostFlush:7354] 2018-12-30 23:52:08,323 ColumnFamilyStore.java:954 - forceFlush requested but everything is clean in mouse_interactions
ERROR [ValidationExecutor:973] 2018-12-30 23:52:08,327 Validator.java:268 - Failed creating a merkle tree for [repair #b3887f60-0c8d-11e9-b894-754001cf0917 on keyspace1/mouse_interactions, [(-8096630393642361664,-8073407512196669022]]], /10.52.5.42 (see log for details)
DEBUG [GossipStage:1] 2018-12-30 23:52:10,643 FailureDetector.java:457 - Ignoring interval time of 2000164738 for /10.52.3.47
DEBUG [GossipStage:1] 2018-12-30 23:52:13,643 FailureDetector.java:457 - Ignoring interval time of 2000239040 for /10.52.3.47
DEBUG [ReadRepairStage:407] 2018-12-30 23:52:15,133 ReadCallback.java:242 - Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(7486012912397474412, 000467657474000020376337363933643363613837616231643531633936396564616234636363613400) (a0e45fcd73255bcd93a63b15d41e0843 vs 7dff1a87a755cf62150befc8352f59e8)
at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92) ~[apache-cassandra-3.10.jar:3.10]
at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233) ~[apache-cassandra-3.10.jar:3.10]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) [apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]
DEBUG [GossipStage:1] 2018-12-30 23:52:26,639 FailureDetector.java:457 - Ignoring interval time of 2000385682 for /10.52.3.47
After few hours at GC logs, I noticed that GC is being called roughly every 20 seconds, and stop for more than 10 seconds each call:
2018-12-31T06:24:57.450+0000: 1184437.292: Total time for which application threads were stopped: 18.0318658 seconds, Stopping threads took: 0.0005000 seconds
2018-12-31T06:24:57.483+0000: 1184437.325: Total time for which application threads were stopped: 0.0053815 seconds, Stopping threads took: 0.0007325 seconds
2018-12-31T06:24:57.565+0000: 1184437.407: Total time for which application threads were stopped: 0.0118127 seconds, Stopping threads took: 0.0057652 seconds
2018-12-31T06:24:57.604+0000: 1184437.446: Total time for which application threads were stopped: 0.0064909 seconds, Stopping threads took: 0.0023037 seconds
2018-12-31T06:24:57.701+0000: 1184437.543: Total time for which application threads were stopped: 0.0066540 seconds, Stopping threads took: 0.0031299 seconds
{Heap before GC invocations=1377084 (full 108682):
par new generation total 943744K, used 943711K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
from space 104832K, 99% used [0x00000005f99a0000, 0x00000005ffff7ce0, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340032K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:24:58.029+0000: 1184437.870: [Full GC (Allocation Failure) 2018-12-31T06:24:58.029+0000: 1184437.871: [CMS: 7340032K->7340031K(7340032K), 15.2051822 secs] 8283743K->7443230K(8283776K), [Metaspace: 71629K->71629K(1116160K)], 15.2055451 secs] [Times: user=13.94 sys=1.28, real=15.20 secs]
Heap after GC invocations=1377085 (full 108683):
par new generation total 943744K, used 103198K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 12% used [0x00000005c0000000, 0x00000005c64c7950, 0x00000005f3340000)
from space 104832K, 0% used [0x00000005f99a0000, 0x00000005f99a0000, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71629K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
}
2018-12-31T06:25:13.235+0000: 1184453.076: Total time for which application threads were stopped: 15.2103050 seconds, Stopping threads took: 0.0004204 seconds
2018-12-31T06:25:13.243+0000: 1184453.085: Total time for which application threads were stopped: 0.0047592 seconds, Stopping threads took: 0.0008416 seconds
2018-12-31T06:25:13.272+0000: 1184453.114: Total time for which application threads were stopped: 0.0085538 seconds, Stopping threads took: 0.0046376 seconds
2018-12-31T06:25:13.298+0000: 1184453.140: [GC (CMS Initial Mark) [1 CMS-initial-mark: 7340031K(7340032K)] 7536074K(8283776K), 0.0650538 secs] [Times: user=0.12 sys=0.01, real=0.06 secs]
2018-12-31T06:25:13.364+0000: 1184453.206: Total time for which application threads were stopped: 0.0728487 seconds, Stopping threads took: 0.0039520 seconds
2018-12-31T06:25:13.364+0000: 1184453.206: [CMS-concurrent-mark-start]
{Heap before GC invocations=1377085 (full 108684):
par new generation total 943744K, used 943215K [0x00000005c0000000, 0x0000000600000000, 0x0000000600000000)
eden space 838912K, 100% used [0x00000005c0000000, 0x00000005f3340000, 0x00000005f3340000)
from space 104832K, 99% used [0x00000005f99a0000, 0x00000005fff7bd98, 0x0000000600000000)
to space 104832K, 0% used [0x00000005f3340000, 0x00000005f3340000, 0x00000005f99a0000)
concurrent mark-sweep generation total 7340032K, used 7340031K [0x0000000600000000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 71631K, capacity 73844K, committed 75000K, reserved 1116160K
class space used 8521K, capacity 8909K, committed 9176K, reserved 1048576K
2018-12-31T06:25:13.821+0000: 1184453.662: [Full GC (Allocation Failure) 2018-12-31T06:25:13.821+0000: 1184453.663: [CMS2018-12-31T06:25:16.960+0000: 1184456.802: [CMS-concurrent-mark: 3.592/3.596 secs] [Times: user=12.47 sys=0.38, real=3.60 secs]
So i've start checking the data spread in cluster - we are using num_of_tokens - 32 data seem to be balanced ~ 40% in each node.
UN 10.52.2.94 672.64 GiB 32 ? ad3d1365-bbb7-4229-b586-40667ec22b41 rack1
UN 10.52.3.47 640.91 GiB 32 ? cdba952b-9685-4769-aaf4-22e538a5c37f rack1
UN 10.52.1.57 719.34 GiB 32 ? 13bb7573-eb30-489b-80c4-6e5a7c8e5f5e rack1
UN 10.52.5.42 743.04 GiB 32 ? c9e892c6-9281-4a49-b4c4-a147a387c3de rack1
UN 10.52.4.43 691.1 GiB 32 ? 53e3724e-f5a9-46b1-b330-7bb542f15f90 rack1
So after checking the logs in other node I cant find any reason for those timeout in that specific node. Any thoughts or ideas as for what cause this to happen on the same node again and again ?
Upvotes: 3
Views: 2254
Reputation: 57748
That's really odd to only see this on one node. Double-check that the configs are the same. Otherwise, you might be writing/querying a large partition which that node is primarily responsible for.
Replication factor - 2 Consistency level - 2
In general, repairs can cause nodes to have trouble serving requests, as building Merkle Trees and streaming data are quite resource-intensive. I see two problems here:
Starting with #2, when you have a RF=2 and you're requiring 2 replicas to respond on your queries, you are essentially querying at CONSISTENCY_ALL. Therefore, should a node become overwhelmed and short on resources, your queries will be unable to complete. If it were me, I would increase the RF to 3, and run a repair (assuming the nodes have adequate storage). The other option, would be to query at a consistency level of 1 (ONE), which is probably what you should be doing with a RF=2 anyway.
Also, when diagnosing query problems, it usually helps to see the actual query being run (which you have not provided). With Cassandra, more often than not, query issues are the result of queries which don't fit the designed table.
As for the long GC pauses, that's a trickier problem to solve. There's an old Cassandra JIRA (CASSANDRA-8150) ticket which talks about optimal settings for the CMS Garbage Collector. You should give that a read.
What is your MAX_HEAP
set to? I see your new generation is less than 1GB, which is going to be too small. With 23GB of RAM on each node, I'd start with the following settings for using CMS GC:
Xmx
) of 8GB-12GB (you want to leave about 50% of RAM for the OS/page-cache).Xms
) equal to Max Heap Size to avoid the performance hit of resizing the heap.Xmn
) of 40-50% of your Max Heap Size (so somewhere between 3GB-6GB). You want plenty of room available for the new generation.MaxTenuringThreshold
of 6 to 8, as you want objects to be passed around the new gen survivor spaces until they die, in lieu of being promoted to the old gen. By default, I think this is set to 1, which means objects will get promoted too quickly.Basically, new-to-old/old-to-permanent promotion is where the long pauses happen. Ideally, you'd like all objects on your heap to be created, live, and die in the new gen, without being promoted.
Otherwise, it might be worth your while to try using G1GC. For G1, I'd go with a Max/Initial Heap of about 50% of RAM, a MaxGCPauseMillis
of 400-500ms, and don't set Heap New Size at all (G1 computes that).
Upvotes: 1