Vinod Jayachandran
Vinod Jayachandran

Reputation: 3898

Cassandra throws OutOfMemory

In our test environment, We have a 1 node cassandra cluster with RF=1 for all keyspaces.

VM arguments of interest are listed below

-XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn1G -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8

We noticed Full GC happening frequently and cassandra getting unresponsive during GC.

INFO  [Service Thread] 2016-12-29 15:52:40,901 GCInspector.java:252 - ParNew GC in 238ms.  CMS Old Gen: 782576192 -> 802826248; Par Survivor Space: 60068168 -> 32163264

INFO  [Service Thread] 2016-12-29 15:52:40,902 GCInspector.java:252 - ConcurrentMarkSweep GC in 1448ms.  CMS Old Gen: 802826248 -> 393377248; Par Eden Space: 859045888 -> 0; Par Survivor Space: 32163264 -> 0

We are getting java.lang.OutOfMemoryError with below exception

ERROR [SharedPool-Worker-5] 2017-01-26 09:23:13,694 JVMStabilityInspector.java:94 - JVM state determined to be unstable.  Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) ~[na:1.7.0_80]
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) ~[na:1.7.0_80]
        at org.apache.cassandra.utils.memory.SlabAllocator.getRegion(SlabAllocator.java:137) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:97) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:57) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.ContextAllocator.clone(ContextAllocator.java:47) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.memory.MemtableBufferAllocator.clone(MemtableBufferAllocator.java:61) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Memtable.put(Memtable.java:192) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1237) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:400) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:363) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:214) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:1033) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2224) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_80]
        at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.8.jar:2.1.8]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]

We were able restore the cassandra after executing nodetool repair.

nodetool status

Datacenter: DC1

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens  Owns    Host ID                               Rack
UN  10.3.211.3  5.74 GB    256     ?       32251391-5eee-4891-996d-30fb225116a1  RAC1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

nodetool info

ID                     : 32251391-5eee-4891-996d-30fb225116a1
Gossip active          : true
Thrift active          : true
Native Transport active: true
Load                   : 5.74 GB
Generation No          : 1485526088
Uptime (seconds)       : 330651
Heap Memory (MB)       : 812.72 / 1945.63
Off Heap Memory (MB)   : 7.63
Data Center            : DC1
Rack                   : RAC1
Exceptions             : 0
Key Cache              : entries 68, size 6.61 KB, capacity 97 MB, 1158 hits, 1276 requests, 0.908 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 48 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token                  : (invoke with -T/--tokens to see all 256 tokens)

From System.log, I see lots of compacting large partitiion

WARN  [CompactionExecutor:33463] 2016-12-24 05:42:29,550 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name:2016-12-23 00:00+0530 (142735455 bytes)
WARN  [CompactionExecutor:33465] 2016-12-24 05:47:57,343 SSTableWriter.java:240 - Compacting large partition mydb/Table_Name_2:22:0c2e6c00-a5a3-11e6-a05e-1f69f32db21c (162203393 bytes)

For Tombstone I notice below in system.log

[main] 2016-12-28 18:23:06,534 YamlConfigurationLoader.java:135 - Node configuration:[authenticator=PasswordAuthenticator; authorizer=CassandraAuthorizer; auto_snapshot=true; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; cas_contention_timeout_in_ms=1000; client_encryption_options=; cluster_name=bankbazaar; column_index_size_in_kb=64; commit_failure_policy=ignore; commitlog_directory=/var/cassandra/log/commitlog; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_period_in_ms=10000; compaction_throughput_mb_per_sec=16; concurrent_counter_writes=32; concurrent_reads=32; concurrent_writes=32; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=15000; cross_node_timeout=false; data_file_directories=[/cryptfs/sdb/cassandra/data, /cryptfs/sdc/cassandra/data, /cryptfs/sdd/cassandra/data]; disk_failure_policy=best_effort; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; endpoint_snitch=GossipingPropertyFileSnitch; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; incremental_backups=false; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; internode_compression=all; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=127.0.0.1; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; native_transport_port=9042; num_tokens=256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_validity_in_ms=2000; range_request_timeout_in_ms=20000; read_request_timeout_in_ms=10000; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=20000; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=127.0.0.1; rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; saved_caches_directory=/var/cassandra/data/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=127.0.0.1}]}]; server_encryption_options=; snapshot_before_compaction=false; ssl_storage_port=9001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=true; storage_port=9000; thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; write_request_timeout_in_ms=5000]

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
CounterMutationStage              0         0              0         0                 0
ReadStage                        32      4061       50469243         0                 0
RequestResponseStage              0         0              0         0                 0
MutationStage                    32        22       27665114         0                 0
ReadRepairStage                   0         0              0         0                 0
GossipStage                       0         0              0         0                 0
CacheCleanupExecutor              0         0              0         0                 0
AntiEntropyStage                  0         0              0         0                 0
MigrationStage                    0         0              0         0                 0
Sampler                           0         0              0         0                 0
ValidationExecutor                0         0              0         0                 0
CommitLogArchiver                 0         0              0         0                 0
MiscStage                         0         0              0         0                 0
MemtableFlushWriter               0         0           7769         0                 0
MemtableReclaimMemory             1        57          13433         0                 0
PendingRangeCalculator            0         0              1         0                 0
MemtablePostFlush                 0         0           9279         0                 0
CompactionExecutor                3        47         169022         0                 0
InternalResponseStage             0         0              0         0                 0
HintedHandoff                     0         1            148         0                 0

Is there any YAML/other config to be used to avoid "large compaction"

What is the correct Compaction Strategy to be used ? Can OutOfMemory because of wrong Compaction Strategy

In one of the keyspace we have write once and read multiple times for each row.

For another keyspace we have Timeseries kind of data where it's insert only and multiple reads

Upvotes: 0

Views: 2892

Answers (1)

MarcintheCloud
MarcintheCloud

Reputation: 1653

Seeing this: Heap Memory (MB): 812.72 / 1945.63 tells me that your 1 machine is probably under powered. There's a good chance that you're not able to keep up with GC.

While in this case, I think this is probably related to being undersized - access patterns, data model, and payload size can also affect GC so if you'd like to update your post with that info, I can update my answer to reflect that.

EDIT to reflect new information

Thanks for adding additional information. Based on what you posted, there are two immediate things I notice that can cause your heap to blow:

Large Partition:

It looks like compaction had to compact 2 partitions that exceeded 100mb (140 and 160 mb respectively). Normally, that would still be ok (not great) but because you're running on under powered hardware with such a small heap, that's quite a lot.

The thing about compaction

It uses a healthy mix of resources when it runs. It's business as usual so it's something you should test and plan for. In this case, I'm certain that compaction is working harder because of the large partition which is using CPU resources (that GC needs), heap, and IO.

This brings me to another concern:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
CounterMutationStage              0         0              0         0                 0
ReadStage                        32      4061       50469243         0                 0

This is usually a sign that you either need to scale up and/or scale out. In your case, you might want to do both. You can exhaust a single, under-powered node pretty quickly with an un-optimized data model. You also don't get to experience the nuances of a distributed system when you test in a single node environment.

So the TL;DR:

For a read heavy workload (which this seems to be), you'll need a larger heap. For over all sanity and cluster health, you'll need to revisit your data model to make sure the partitioning logic is sound. If you're not sure about how or why to do either, I suggest spending some time here: https://academy.datastax.com/courses

Upvotes: 1

Related Questions