Reputation: 3898
In our test environment, We have a 1 node cassandra cluster with RF=1 for all keyspaces.
VM arguments of interest are listed below
-XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn1G -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
We noticed Full GC happening frequently and cassandra getting unresponsive during GC.
INFO [Service Thread] 2016-12-29 15:52:40,901 GCInspector.java:252 - ParNew GC in 238ms. CMS Old Gen: 782576192 -> 802826248; Par Survivor Space: 60068168 -> 32163264
INFO [Service Thread] 2016-12-29 15:52:40,902 GCInspector.java:252 - ConcurrentMarkSweep GC in 1448ms. CMS Old Gen: 802826248 -> 393377248; Par Eden Space: 859045888 -> 0; Par Survivor Space: 32163264 -> 0
We are getting java.lang.OutOfMemoryError with below exception
ERROR [SharedPool-Worker-5] 2017-01-26 09:23:13,694 JVMStabilityInspector.java:94 - JVM state determined to be unstable. Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[na:1.7.0_80]
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) ~[na:1.7.0_80]
at org.apache.cassandra.utils.memory.SlabAllocator.getRegion(SlabAllocator.java:137) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:97) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:57) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.ContextAllocator.clone(ContextAllocator.java:47) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.utils.memory.MemtableBufferAllocator.clone(MemtableBufferAllocator.java:61) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Memtable.put(Memtable.java:192) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1237) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:400) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:363) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.db.Mutation.apply(Mutation.java:214) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:1033) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2224) ~[apache-cassandra-2.1.8.jar:2.1.8]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_80]
at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.8.jar:2.1.8]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
We were able restore the cassandra after executing nodetool repair.
nodetool status
Datacenter: DC1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.3.211.3 5.74 GB 256 ? 32251391-5eee-4891-996d-30fb225116a1 RAC1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
nodetool info
ID : 32251391-5eee-4891-996d-30fb225116a1
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 5.74 GB
Generation No : 1485526088
Uptime (seconds) : 330651
Heap Memory (MB) : 812.72 / 1945.63
Off Heap Memory (MB) : 7.63
Data Center : DC1
Rack : RAC1
Exceptions : 0
Key Cache : entries 68, size 6.61 KB, capacity 97 MB, 1158 hits, 1276 requests, 0.908 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 48 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token : (invoke with -T/--tokens to see all 256 tokens)
From System.log, I see lots of compacting large partitiion
WARN [CompactionExecutor:33463] 2016-12-24 05:42:29,550 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name:2016-12-23 00:00+0530 (142735455 bytes)
WARN [CompactionExecutor:33465] 2016-12-24 05:47:57,343 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name_2:22:0c2e6c00-a5a3-11e6-a05e-1f69f32db21c (162203393 bytes)
For Tombstone I notice below in system.log
[main] 2016-12-28 18:23:06,534 YamlConfigurationLoader.java:135 - Node configuration:[authenticator=PasswordAuthenticator; authorizer=CassandraAuthorizer; auto_snapshot=true; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; cas_contention_timeout_in_ms=1000; client_encryption_options=; cluster_name=bankbazaar; column_index_size_in_kb=64; commit_failure_policy=ignore; commitlog_directory=/var/cassandra/log/commitlog; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_period_in_ms=10000; compaction_throughput_mb_per_sec=16; concurrent_counter_writes=32; concurrent_reads=32; concurrent_writes=32; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=15000; cross_node_timeout=false; data_file_directories=[/cryptfs/sdb/cassandra/data, /cryptfs/sdc/cassandra/data, /cryptfs/sdd/cassandra/data]; disk_failure_policy=best_effort; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; endpoint_snitch=GossipingPropertyFileSnitch; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; incremental_backups=false; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; internode_compression=all; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=127.0.0.1; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; native_transport_port=9042; num_tokens=256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_validity_in_ms=2000; range_request_timeout_in_ms=20000; read_request_timeout_in_ms=10000; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=20000; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=127.0.0.1; rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; saved_caches_directory=/var/cassandra/data/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=127.0.0.1}]}]; server_encryption_options=; snapshot_before_compaction=false; ssl_storage_port=9001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=true; storage_port=9000; thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; write_request_timeout_in_ms=5000]
nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 32 4061 50469243 0 0
RequestResponseStage 0 0 0 0 0
MutationStage 32 22 27665114 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 0 0 7769 0 0
MemtableReclaimMemory 1 57 13433 0 0
PendingRangeCalculator 0 0 1 0 0
MemtablePostFlush 0 0 9279 0 0
CompactionExecutor 3 47 169022 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 0 1 148 0 0
Is there any YAML/other config to be used to avoid "large compaction"
What is the correct Compaction Strategy to be used ? Can OutOfMemory because of wrong Compaction Strategy
In one of the keyspace we have write once and read multiple times for each row.
For another keyspace we have Timeseries kind of data where it's insert only and multiple reads
Upvotes: 0
Views: 2892
Reputation: 1653
Seeing this: Heap Memory (MB): 812.72 / 1945.63 tells me that your 1 machine is probably under powered. There's a good chance that you're not able to keep up with GC.
While in this case, I think this is probably related to being undersized - access patterns, data model, and payload size can also affect GC so if you'd like to update your post with that info, I can update my answer to reflect that.
EDIT to reflect new information
Thanks for adding additional information. Based on what you posted, there are two immediate things I notice that can cause your heap to blow:
Large Partition:
It looks like compaction had to compact 2 partitions that exceeded 100mb (140 and 160 mb respectively). Normally, that would still be ok (not great) but because you're running on under powered hardware with such a small heap, that's quite a lot.
The thing about compaction
It uses a healthy mix of resources when it runs. It's business as usual so it's something you should test and plan for. In this case, I'm certain that compaction is working harder because of the large partition which is using CPU resources (that GC needs), heap, and IO.
This brings me to another concern:
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 32 4061 50469243 0 0
This is usually a sign that you either need to scale up and/or scale out. In your case, you might want to do both. You can exhaust a single, under-powered node pretty quickly with an un-optimized data model. You also don't get to experience the nuances of a distributed system when you test in a single node environment.
So the TL;DR:
For a read heavy workload (which this seems to be), you'll need a larger heap. For over all sanity and cluster health, you'll need to revisit your data model to make sure the partitioning logic is sound. If you're not sure about how or why to do either, I suggest spending some time here: https://academy.datastax.com/courses
Upvotes: 1