ftrujillo
ftrujillo

Reputation: 1192

Cassandra NoHostAvailableException when deletes are executed with cqlsh

We have a cluster with 7 nodes and we use the datastax java driver to connect to the cluster. The problem is that I am getting constant NoHostAvailableException like this:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 hosts, use getErrors() for more details])

All the nodes are up:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1  RAC1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.

Our configuration for the java connection is:

com.datastax.driver.core.Cluster cluster = null;
        //Get contact points
        String[] contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
        cluster = com.datastax.driver.core.Cluster.builder()
            .addContactPoints(contactPoints))
            .withCredentials(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_USERNAME), 
                this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PASSWORD))
                .withQueryOptions(new QueryOptions()
                .setConsistencyLevel(ConsistencyLevel.QUORUM))
                .withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
                .withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
                .withPort(Integer.parseInt(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PORT)))
                .build();

        Metadata metadata = cluster.getMetadata();
        for ( Host host : metadata.getAllHosts() ) {
            LOG.info("Datacenter: "+host.getDatacenter()+"; Host: "+host.getAddress()+"; DC: "+host.getDatacenter()+"\n");
        }

and the contact points are:

172.31.7.244,172.31.7.243,172.31.7.245,172.31.7.246,172.31.7.247

Anyone knows how I can solve this problem? Or at least have anyone some hint about how to deal with this situation?

Update: If I get the error messages withe.getErrors() I obtain:

/172.31.7.243:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.243:9042] Operation timed out, /172.31.7.244:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.244:9042] Operation timed out, /172.31.7.245:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.245:9042] Operation timed out, /172.31.7.246:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.246:9042] Operation timed out, /172.31.7.247:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.247:9042] Operation timed out}

UPDATE:

DELETE FROM keyspace_name.search WHERE idline1 = 837 and idline2 = 841 and partid = 8558 and id = 18c04c20-8a3a-11e5-9e20-0025905a2ab2;

CREATE TABLE search ( idline1 bigint, idline2 bigint, partid int, id uuid, field3 int, field4 int, field5 int, field6 int, field7 int, field8 int, field9 double, field10 bigint, field11 bigint, field12 bigint, field13 boolean, field14 boolean, field15 int, field16 bigint, field17 int, field18 int, field19 int, field20 int, field21 uuid, field22 boolean, PRIMARY KEY ((idline1, idline2, partid), id) ) WITH bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND comment='Table with the snp between lines' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=0 AND index_interval=128 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX search_partid ON search (partid);

CREATE INDEX search_field8 ON search (field8);

UPDATE (18-03-2016):

After the deletes start to be executed I found the cpu of some of the nodes increases a lot:

enter image description here

I check the processes on that nodes and only cassandra is running but consuming a lot of cpu. The rest of the nodes are not using almost cpu.

UPDATE (04-04-2016): I do not know if it is related. I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned).

Checing the thread pool stats:

nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 20042001 0 0 RequestResponseStage 0 0 149365845 0 0 MutationStage 32 117720 181498576 0 0 ReadRepairStage 0 0 799373 0 0 ReplicateOnWriteStage 0 0 13624173 0 0 GossipStage 0 0 5580503 0 0 CacheCleanupExecutor 0 0 0 0 0 AntiEntropyStage 0 0 32173 0 0 MigrationStage 0 0 9 0 0 MemtablePostFlusher 0 0 45044 0 0 MemoryMeter 0 0 9553 0 0 FlushWriter 0 0 9425 0 18 ValidationExecutor 0 0 15980 0 0 MiscStage 0 0 0 0 0 PendingRangeCalculator 0 0 7 0 0 CompactionExecutor 0 0 1293147 0 0 commitlog_archiver 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 0 0 273 0 0

Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 0 MUTATION 0 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0

I realize that the pending mutation stages are growing but the active value remain the same, could be this the problem?

Upvotes: 0

Views: 1255

Answers (1)

HashtagMarkus
HashtagMarkus

Reputation: 1661

I see two problems with your datamodel.

  • You use two secondary indexes. One is on a field on the partition key. I don't know how cassandra behaves in this case. Worst case is, that even if you use the complete partition key (like you do in your example delete) cassandra does a lookup in the secondary index. In that case this would mean a full cluster scan, because secondary indexes are only stored per partition. Since only a part of the partition key is indexed cassandra does not know on which partition the index informations lies. This behavior at least would explain the timeouts.

  • You said, you delete a lot of rows in a specific partition. That is also a problem. For each deletion cassandra creates a tombstone. The more tombstones there are, the slower the read will become. This will sooner or later lead to timeouts or exceptions (I believe cassandra will write warnings when 1000 tombstones are reached and throw exceptions when 10.000 tombstones are reached). Btw. these tombstones are also created in the secondary index. By default cassandra will remove tombstones after gc_grace_seconds (by default 10 days) when a compaction is performed. You could change this property per table. More information on these table properties can be found here: Table Properties

I believe the first point could be the reason for the timeouts.

Upvotes: 0

Related Questions