Reputation: 1023
I am working on a single node. I have the following table to store a list of documents:
CREATE TABLE my_keyspace.document (
status text,
date timestamp,
doc_id text,
raw_content text,
title text,
PRIMARY KEY (status, date, doc_id)
) WITH CLUSTERING ORDER BY (date ASC, doc_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX doc_id_idx ON my_keyspace.document (doc_id);
I am doing a lot of queries like:
SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND data>=start_date AND data<=end_date;
For some reason it is very slow, at first the warnings that I had were this:
[2016-07-26 18:10:46] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)
[2016-07-26 18:10:52] {cassandra.protocol:378} WARNING - Server warning: Read 5000 live rows and 19999 tombstone cells for query SELECT * FROM my_keyspace.document WHERE token(status) >= token(PROCESSED) AND token(status) <= token(PROCESSED) AND date >= 2016-07-08 02:00+0200 AND date <= 2016-07-23 01:59+0200 LIMIT 5000 (see tombstone_warn_threshold)
Thinking the issue was linked to having too many tombestones I did:
ALTER TABLE my_keyspace.document WITH gc_grace_seconds = '0';
and then:
nodetool compact my_keyspace document
Now I don't have any warning but the queries are still very slow and often timeout. No message is displayed in any logs concerning the timeout.The number of documents I am having is roughly 200k instances. Those documents are distributed over a 20 days period, about 4500 documents have status='PROCESSED'
each day. The queries answer time vary depending of the date range: about 3 seconds for a one day time range, 15 secs for 4 days and timeout for 2 weeks. Also, I disabled the swap. The version of Cassandra I am using is 3.5.
Recently I've noticed that giving the precise columns to extract instead of * is improving the response time a bit, but the system is still too slow.
EDIT: Computing partition size as proposed by Reveka
So, following the formula:
So the number of values is 90000*(19-3)=1,440,000
For the size of the partition, I got to an estimate of about 1.2GB.
This might be a bit big. But how can I modify my partition key to still be able to do the same range queries while having smaller partitions? I could have a composite partition key containing the status
and the day extracted from date
, but wouldn't I have to then specify the day before being able to query by range:
SELECT * FROM my_keyspace.document WHERE status='PROCESSED' AND day='someday' AND date>='start_date' AND date<='end_date';
Which forces me to do one query per day.
Upvotes: 1
Views: 1064
Reputation: 101
I see that your primary key consists of status, date and doc_id and you only use status as your partition key. That means that all the documents of the same status regardless of date will be put in the same partition. I guess that is a lot of information for one partition. Cassandra works well in partitions that are 100MB (or a couple of hundred MB in later versions) big, see here. Datastax D220 cource (it is free you just need to create an account) has a video that shows you how to calculate your partition size. You can post the results to your analysis so we can further help you. :)
EDIT: After the size analysis
You will have to make your partition by date in order to have smaller partition. That means that now you will not be able to query by range. A workaround for this would be to do multiple queries based on the range you want. For example: if you want to do a query for range 12 August to 14 August you split by day and do three queries, one for 12 August, one for 13 and one for 14. Again though, if your range is big you will end up retrieving gb of data. I do not know your use case, but I am going to make a guess that you don't need gb worth of files everytime you do a date range query. Can you give me more info on your use case (a.k.a what do you want to do?)
ps. I can't write comments yet so I can only advice you through this answer
Upvotes: 3