Filipe Teixeira
Filipe Teixeira

Reputation: 3565

Can Cassandra Compaction Lead To Timeouts?

I am using Titan 1.0.0 to bulk load a large amount of data (35M vertices) as quickly as possible into a single node instance of Cassandra. During this process a cleanup procedure triggers periodically which mutates some properties on x nodes, where 10000 < x <= 500000. I ensure that each transaction only affects exactly 100 vertices.

Initially this process works but once my graph develops some super nodes I start to see the following exception:

com.thinkaurelius.titan.diskstorage.TemporaryBackendException: 
Caused by: com.netflix.astyanax.connectionpool.exceptions.OperationTimeoutException: OperationTimeoutException: [host=172.18.02(172.18.0.2):9160, latency=4031(4031), attempts=1]TimedOutException(acknowledged_by:0, acknowledged_by_batchlog:true)
    at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:171) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:153) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119) ~[astyanax-core-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352) ~[astyanax-core-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:517) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$000(ThriftKeyspaceImpl.java:93) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1.execute(ThriftKeyspaceImpl.java:137) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxStoreManager.mutateMany(AstyanaxStoreManager.java:389) ~[titan-cassandra-1.0.0.jar!/:na]
    ... 22 common frames omitted
Caused by: org.apache.cassandra.thrift.TimedOutException: null
    at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29624) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
    at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result$atomic_batch_mutate_resultStandardScheme.read(Cassandra.java:29592) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
    at org.apache.cassandra.thrift.Cassandra$atomic_batch_mutate_result.read(Cassandra.java:29526) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) ~[libthrift-0.9.2.jar!/:0.9.2]
    at org.apache.cassandra.thrift.Cassandra$Client.recv_atomic_batch_mutate(Cassandra.java:1108) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
    at org.apache.cassandra.thrift.Cassandra$Client.atomic_batch_mutate(Cassandra.java:1094) ~[cassandra-thrift-2.1.9.jar!/:2.1.9]
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:147) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1$1.internalExecute(ThriftKeyspaceImpl.java:141) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
    ... 30 common frames omitted

I have noticed that when this happens Cassandra is busy running large compaction jobs:

WARN  09:43:59 Compacting large partition test/edgestore:0000000000357200 (252188100 bytes)
WARN  09:47:03 Compacting large partition test/edgestore:6800000000365a80 (1417482764 bytes)
WARN  09:48:37 Compacting large partition test/edgestore:0000000000002480 (127497758 bytes)
WARN  09:51:58 Compacting large partition test/edgestore:6000000000376d00 (227606217 bytes)
WARN  09:54:35 Compacting large partition test/edgestore:d000000000002b00 (124082466 bytes)
WARN  09:58:24 Compacting large partition test/edgestore:6800000000354380 (172991088 bytes)

So the question is simple: Can Cassandra Compaction lead to the above timeouts and if so, what is the best approach for handling this ?

Upvotes: 0

Views: 1125

Answers (1)

doanduyhai
doanduyhai

Reputation: 8812

Can Cassandra Compaction lead to the above timeouts and if so, what is the best approach for handling this ?

Yes definitely. The root cause may not be the compaction itself but because of saturated I/O bandwidth. Below is one possible chain of issue:

  1. Heavy compaction
  2. Disk I/O is not keeping up fast enough
  3. Data are staying longer in memory
  4. Data are promoted to JVM old generation
  5. Stop-the-world garbage collection start to kick in
  6. The node is detected as down by other node

The first thing to check is to grep for the keyword "GC" in your /var/log/cassandra/system.log and to monitor the I/O and CPU I/O wait using dstat tool

Also how much is the configure JVM Heap size?

Upvotes: 1

Related Questions