Reputation: 1192
We have a cluster with 7 nodes and we use the datastax java driver to connect to the cluster. The problem is that I am getting constant NoHostAvailableException like this:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.31.7.243:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.245:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.246:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)), /172.31.7.247:9042, /172.31.7.232:9042, /172.31.7.233:9042, /172.31.7.244:9042 [only showing errors of first 3 hosts, use getErrors() for more details])
All the nodes are up:
UN 172.31.7.244 152.21 GB 256 14.5% 58abea69-e7ba-4e57-9609-24f3673a7e58 RAC1
UN 172.31.7.245 168.4 GB 256 14.5% bc11b4f0-cf96-4ca5-9a3e-33cc2b92a752 RAC1
UN 172.31.7.246 177.71 GB 256 13.7% 8dc7bb3d-38f7-49b9-b8db-a622cc80346c RAC1
UN 172.31.7.247 158.57 GB 256 14.1% 94022081-a563-4042-81ab-75ffe4d13194 RAC1
UN 172.31.7.243 176.83 GB 256 14.6% 0dda3410-db58-42f2-9351-068bdf68f530 RAC1
UN 172.31.7.233 159 GB 256 13.6% 01e013fb-2f57-44fb-b3c5-fd89d705bfdd RAC1
UN 172.31.7.232 166.05 GB 256 15.0% 4d009603-faa9-4add-b3a2-fe24ec16a7c1 RAC1
but two of them have high cpu load, especially the 232 because I am running a lot of deletes using cqlsh in that node.
I know that deletes generate tombstones, but with 7 nodes in the cluster I do not think is normal that all the host are not accesible.
Our configuration for the java connection is:
com.datastax.driver.core.Cluster cluster = null;
//Get contact points
String[] contactPoints=this.environment.getRequiredProperty(CASSANDRA_CLUSTER_URL).split(",");
cluster = com.datastax.driver.core.Cluster.builder()
.addContactPoints(contactPoints))
.withCredentials(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_USERNAME),
this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PASSWORD))
.withQueryOptions(new QueryOptions()
.setConsistencyLevel(ConsistencyLevel.QUORUM))
.withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
.withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
.withPort(Integer.parseInt(this.environment.getRequiredProperty(CASSANDRA_CLUSTER_PORT)))
.build();
Metadata metadata = cluster.getMetadata();
for ( Host host : metadata.getAllHosts() ) {
LOG.info("Datacenter: "+host.getDatacenter()+"; Host: "+host.getAddress()+"; DC: "+host.getDatacenter()+"\n");
}
and the contact points are:
172.31.7.244,172.31.7.243,172.31.7.245,172.31.7.246,172.31.7.247
Anyone knows how I can solve this problem? Or at least have anyone some hint about how to deal with this situation?
Update: If I get the error messages withe.getErrors() I obtain:
/172.31.7.243:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.243:9042] Operation timed out, /172.31.7.244:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.244:9042] Operation timed out, /172.31.7.245:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.245:9042] Operation timed out, /172.31.7.246:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.246:9042] Operation timed out, /172.31.7.247:9042=com.datastax.driver.core.OperationTimedOutException: [/172.31.7.247:9042] Operation timed out}
UPDATE:
For the deletes Im running them using different files with the cql queries:
cqlsh ip_node_1 -f script-1.duplicates cqlsh ip_node_1 -f script-2.duplicates cqlsh ip_node_1 -f script-3.duplicates ...
I am not specifying any consistency level, so is using the default one which is ONE.
Each of the previous files contain deletes like this:
DELETE FROM keyspace_name.search WHERE idline1 = 837 and idline2 = 841 and partid = 8558 and id = 18c04c20-8a3a-11e5-9e20-0025905a2ab2;
CREATE TABLE search ( idline1 bigint, idline2 bigint, partid int, id uuid, field3 int, field4 int, field5 int, field6 int, field7 int, field8 int, field9 double, field10 bigint, field11 bigint, field12 bigint, field13 boolean, field14 boolean, field15 int, field16 bigint, field17 int, field18 int, field19 int, field20 int, field21 uuid, field22 boolean, PRIMARY KEY ((idline1, idline2, partid), id) ) WITH bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND comment='Table with the snp between lines' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=0 AND index_interval=128 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX search_partid ON search (partid);
CREATE INDEX search_field8 ON search (field8);
UPDATE (18-03-2016):
After the deletes start to be executed I found the cpu of some of the nodes increases a lot:
I check the processes on that nodes and only cassandra is running but consuming a lot of cpu. The rest of the nodes are not using almost cpu.
UPDATE (04-04-2016): I do not know if it is related. I check the nodes which a lot of CPU (near 96%) and th gc activity remains on 1.6% (using only 3 GB from the 10 which have assigned).
Checing the thread pool stats:
nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 20042001 0 0 RequestResponseStage 0 0 149365845 0 0 MutationStage 32 117720 181498576 0 0 ReadRepairStage 0 0 799373 0 0 ReplicateOnWriteStage 0 0 13624173 0 0 GossipStage 0 0 5580503 0 0 CacheCleanupExecutor 0 0 0 0 0 AntiEntropyStage 0 0 32173 0 0 MigrationStage 0 0 9 0 0 MemtablePostFlusher 0 0 45044 0 0 MemoryMeter 0 0 9553 0 0 FlushWriter 0 0 9425 0 18 ValidationExecutor 0 0 15980 0 0 MiscStage 0 0 0 0 0 PendingRangeCalculator 0 0 7 0 0 CompactionExecutor 0 0 1293147 0 0 commitlog_archiver 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 HintedHandoff 0 0 273 0 0
Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 0 MUTATION 0 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0
I realize that the pending mutation stages are growing but the active value remain the same, could be this the problem?
Upvotes: 0
Views: 1255
Reputation: 1661
I see two problems with your datamodel.
You use two secondary indexes. One is on a field on the partition key. I don't know how cassandra behaves in this case. Worst case is, that even if you use the complete partition key (like you do in your example delete) cassandra does a lookup in the secondary index. In that case this would mean a full cluster scan, because secondary indexes are only stored per partition. Since only a part of the partition key is indexed cassandra does not know on which partition the index informations lies. This behavior at least would explain the timeouts.
You said, you delete a lot of rows in a specific partition. That is also a problem. For each deletion cassandra creates a tombstone. The more tombstones there are, the slower the read will become. This will sooner or later lead to timeouts or exceptions (I believe cassandra will write warnings when 1000 tombstones are reached and throw exceptions when 10.000 tombstones are reached). Btw. these tombstones are also created in the secondary index. By default cassandra will remove tombstones after gc_grace_seconds (by default 10 days) when a compaction is performed. You could change this property per table. More information on these table properties can be found here: Table Properties
I believe the first point could be the reason for the timeouts.
Upvotes: 0