Reputation: 8360
I have a cassandra cluster of 3 nodes. There's one table storing approx 400M rows. I am hitting below select query:
SELECT * FROM table_1 WHERE vuid in ('abc','def','ghi');
Above is a sample query. In clause in our production environment has 1000 keys. Below is the table structure
CREATE TABLE dmp.user_profiles_9 (
vuid text PRIMARY KEY,
apnid text,
brand_model text,
first_seen timestamp,
ifa text,
last_seen timestamp,
msisdn text,
total_day_count int,
total_usage_count int,
user_type text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
With 1000 keys in "in" clause query takes upwards of 5 seconds. We need to keep updating above data set daily. This job involves full table scan. And in order to finish as quick as possible every read should be faster. What could be done in above case?
Upvotes: 0
Views: 1008
Reputation: 139
One attempt you could try would be to split out the IN clause to multiple queries, you could execute this asynchronously and return the individual results into a complete result set.
An example, and some more discussion, of this can be found here.
This will prevent only one node from coordinating, allowing the load to be properly spread out to the other nodes. If you make this change, it would also benefit from a TokenAware load balancing policy, so that the node with the data you're looking up is hit each time.
Upvotes: 2