Cassandra NoHostAvailableException when deletes are executed with cqlsh

Question

We have a cluster with 7 nodes and we use the datastax java driver to connect to the cluster. The problem is that I am getting constant NoHostAvailableException like this:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 hosts, use getErrors() for more details])

All the nodes are up:

UN  172.31.7.244  152.21 GB  256     14.5%  58abea69-e7ba-4e57-9609-24f3673a7e58  RAC1
UN  172.31.7.245  168.4 GB   256     14.5%  bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752  RAC1
UN  172.31.7.246  177.71 GB  256     13.7%  8dc7bb3d-38f7-49b9-b8db-a622cc80346c  RAC1
UN  172.31.7.247  158.57 GB  256     14.1%  94022081-a563-4042-81ab-75ffe4d13194  RAC1
UN  172.31.7.243  176.83 GB  256     14.6%  0dda3410-db58-42f2-9351-068bdf68f530  RAC1
UN  172.31.7.233  159 GB     256     13.6%  01e013fb-2f57-44fb-b3c5-fd89d705bfdd  RAC1
UN  172.31.7.232  166.05 GB  256     15.0%  4d009603-faa9-4add-b3a2-fe24ec16a7c1  RAC1

but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.

I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.

Our configuration for the java connection is:

com.datastax.driver.core.Cluster cluster = null;
        //Get contact points
        String[] contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
        cluster = com.datastax.driver.core.Cluster.builder()
            .addContactPoints(contactPoints))
            .withCredentials(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_USERNAME), 
                this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PASSWORD))
                .withQueryOptions(new QueryOptions()
                .setConsistencyLevel(ConsistencyLevel.QUORUM))
                .withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
                .withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
                .withPort(Integer.parseInt(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PORT)))
                .build();

        Metadata metadata = cluster.getMetadata();
        for ( Host host : metadata.getAllHosts() ) {
            LOG.info("Datacenter: "+host.getDatacenter()+"; Host: "+host.getAddress()+"; DC: "+host.getDatacenter()+"
");
        }

and the contact points are:

172.31.7.244,172.31.7.243,172.31.7.245,172.31.7.246,172.31.7.247

Anyone knows how I can solve this problem? Or at least have anyone some hint about how to deal with this situation?

Update: If I get the error messages withe.getErrors() I obtain:

/172.31.7.243:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.243:9042] Operation timed out, /172.31.7.244:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.244:9042] Operation timed out, /172.31.7.245:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.245:9042] Operation timed out, /172.31.7.246:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.246:9042] Operation timed out, /172.31.7.247:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.247:9042] Operation timed out}

UPDATE:

The replication factor of the keyspace is 3.
For the deletes Im running them using different files with the cql queries:

cqlsh ip_node_1 -f script-1.duplicates cqlsh ip_node_1 -f script-2.duplicates cqlsh ip_node_1 -f script-3.duplicates ...
I am not specifying any consistency level, so is using the default one which is ONE.
Each of the previous files contain deletes like this:

DELETE FROM keyspace_name.search WHERE idline1 = 837 and idline2 = 841 and partid = 8558 and id = 18c04c20-8a3a-11e5-9e20-0025905a2ab2;

And the column family is:

CREATE TABLE search ( idline1 bigint, idline2 bigint, partid int, id uuid, field3 int, field4 int, field5 int, field6 int, field7 int, field8 int, field9 double, field10 bigint, field11 bigint, field12 bigint, field13 boolean, field14 boolean, field15 int, field16 bigint, field17 int, field18 int, field19 int, field20 int, field21 uuid, field22 boolean, PRIMARY KEY ((idline1, idline2, partid), id) ) WITH bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND comment='Table with the snp between lines' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=0 AND index_interval=128 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX search_partid ON search (partid);

CREATE INDEX search_field8 ON search (field8);

UPDATE (18-03-2016):

After the deletes start to be executed I found the cpu of some of the nodes increases a lot:

I check the processes on that nodes and only cassandra is running but consuming a lot of cpu. The rest of the nodes are not using almost cpu.

UPDATE (04-04-2016): I do not know if it is related. I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned).

Checing the thread pool stats:

nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 20042001 0 0 RequestResponseStage 0 0 149365845 0 0 MutationStage 32 117720 181498576 0 0 ReadRepairStage 0 0 799373 0 0 ReplicateOnWriteStage 0 0 13624173 0 0 GossipStage 0 0 5580503 0 0 CacheCleanupExecutor 0 0 0 0 0 AntiEntropyStage 0 0 32173 0 0 MigrationStage 0 0 9 0 0 MemtablePostFlusher 0 0 45044 0 0 MemoryMeter 0 0 9553 0 0 FlushWriter 0 0 9425 0 18 ValidationExecutor 0 0 15980 0 0 MiscStage 0 0 0 0 0 PendingRangeCalculator 0 0 7 0 0 CompactionExecutor 0 0 1293147 0 0 commitlog_archiver 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 0 0 273 0 0

Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 0 MUTATION 0 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0

I realize that the pending mutation stages are growing but the active value remain the same, could be this the problem?

Cassandra NoHostAvailableException when deletes are executed with cqlsh

Answers (1)

Related Questions