Reputation: 893
I have a Cassandra schema similar to this:
CREATE TABLE table1 (a1 text,
a2 text,
b1 text,
b2 text,
c1 text,
c2 text,
PRIMARY KEY ( (a1, a2), b1, b2) )
);
This is for time series data. a1 is my id and a2 is my time bucket.
QUESTION: How do I properly select an entire time bucket?
I can select by a1 or a2. But if I select by a1 AND a2 the request times out.
SELECT * FROM table1 WHERE a1 = 'id1' AND a2 = 'time_bucket37' LIMIT 100 ALLOW FILTERING;
Result: Request did not complete within rpc_timeout.
Here is my::
cqlsh:testmulti1> DESCRIBE TABLE datasimple;
CREATE TABLE datasimple (
dc_name text,
time_bucket text,
partition int,
perf_name text,
vm_name text,
measurement_time timestamp,
perf_data int,
platform text,
PRIMARY KEY ((dc_name, time_bucket, partition), perf_name, vm_name, measurement_time)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
CREATE INDEX datasimple_dc_name_idx_10 ON datasimple (dc_name);
CREATE INDEX datasimple_time_bucket_idx_10 ON datasimple (time_bucket);
CREATE INDEX datasimple_perf_name_idx_10 ON datasimple (perf_name);
CREATE INDEX datasimple_vm_name_idx_10 ON datasimple (vm_name);
CREATE INDEX datasimple_platform_idx_10 ON datasimple (platform);
I am trying to:
SELECT *
FROM DataSimple
WHERE dc_name = 'dc1'
AND time_bucket = '2014031922'
AND vm_name = 'vm1' LIMIT 1000;
Upvotes: 2
Views: 2510
Reputation: 57748
SELECT *
FROM DataSimple
WHERE dc_name = 'dc1'
AND time_bucket = '2014031922'
AND vm_name = 'vm1'
LIMIT 1000;
That's not going to work. Your current PRIMARY KEY
is:
PRIMARY KEY ((dc_name, time_bucket, partition),
perf_name, vm_name, measurement_time)
Try to think of using a compound PRIMARY KEY in Cassandra as a hierarchical path, starting from left to right. dc_name
, time_bucket
, and partition
are a part of your partitioning key, so you'll need to include each of those in your WHERE
to ensure that Cassandra can find where your data is stored.
All of the columns after that are clustering columns (perf_name
, vm_name
, and measurement_time
), which determine the sort order of your data on disk. Again, you can't pick and choose which of these to include. You need to search on them in the order that they are defined in your schema. So if you want to query by vm_name
, then you also need to make sure that the preceding column keys (perf_name
in this case) are also present in your WHERE
clause.
Basically to query this schema, add AND
s to your WHERE
clause for:
Upvotes: 5