Slow range queries in Cassandra

Question

I am working on a single node. I have the following table to store a list of documents:

CREATE TABLE my_keyspace.document (
    status text,
    date timestamp,
    doc_id text,
    raw_content text,
    title text,
    PRIMARY KEY (status, date, doc_id)
) WITH CLUSTERING ORDER BY (date ASC, doc_id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX doc_id_idx ON my_keyspace.document (doc_id);

I am doing a lot of queries like:

SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND data>=start_date AND data<=end_date;

For some reason it is very slow, at first the warnings that I had were this:

[2016-07-26 18:10:46] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)
[2016-07-26 18:10:52] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)

Thinking the issue was linked to having too many tombestones I did:

ALTER TABLE my_keyspace.document  WITH gc_grace_seconds = '0';

and then:

nodetool compact my_keyspace document

Now I don't have any warning but the queries are still very slow and often timeout. No message is displayed in any logs concerning the timeout.The number of documents I am having is roughly 200k instances. Those documents are distributed over a 20 days period, about 4500 documents have status='PROCESSED' each day. The queries answer time vary depending of the date range: about 3 seconds for a one day time range, 15 secs for 4 days and timeout for 2 weeks. Also, I disabled the swap. The version of Cassandra I am using is 3.5.

Recently I've noticed that giving the precise columns to extract instead of * is improving the response time a bit, but the system is still too slow.

EDIT: Computing partition size as proposed by Reveka

So, following the formula:

Number of rows = 20 * 4500 = 90,000
Number of columns = 19
Number of primary keys = 3
Number of static column = 0

So the number of values is 90000*(19-3)=1,440,000

For the size of the partition, I got to an estimate of about 1.2GB. This might be a bit big. But how can I modify my partition key to still be able to do the same range queries while having smaller partitions? I could have a composite partition key containing the status and the day extracted from date, but wouldn't I have to then specify the day before being able to query by range:

SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND day='someday' AND date>='start_date' AND date<='end_date';

Which forces me to do one query per day.

Slow range queries in Cassandra

Answers (1)

Related Questions