Cassandra time series modeling

Question

I have a table like this.

> CREATE TABLE docyard.documents (
>     document_id text,
>     namespace text,
>     version_id text,
>     created_at timestamp,
>     path text,
>     attributes map
>     PRIMARY KEY (document_id, namespace, version_id, created_at) ) WITH CLUSTERING ORDER BY (namespace ASC, version_id ASC, created_at
> ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';

I want to be able to do the range queries on following conditions-

select * from documents where namespace = 'something' and created_at> 'some-value' order by created_at allow filtering;

select from documents where namespace = 'something' and path = 'something' and created_at> 'some-value' order by created_at allow filtering;

I am not able to make these queries work in any manner. Tried secondary indexes as well. Can anyone please help?

I keep getting some or the other when trying to make it work.

Aaron · Accepted Answer

First of all, don't use secondary indexes or ALLOW FILTERING. With timeseries data that will perform terribly over time.

To satisfy your first query, you will want to restructure your PRIMARY KEY and CLUSTERING ORDER like this:

PRIMARY KEY (namespace, created_at, document_id) ) 
WITH CLUSTERING ORDER BY (created_at DESC, document_id ASC);

This will allow for the following:

Partitioning by namespace.
Sorting by created_at in DESCending order (most-recent rows read first).
Uniqueness by document_id
You will not need ALLOW FILTERING or ORDER BY in your query, as the necessary keys will be provided, and the results will already be sorted to your CLUSTERING ORDER.

For your second query, I would create an additional query table. This is because in Cassandra, you need to model your tables to suit your queries. You may end-up having several query tables for the same data, and that's ok.

CREATE TABLE docyardbypath.documents (
  document_id text,
  namespace text,
  version_id text,
  created_at timestamp,
  path text,
  attributes map
PRIMARY KEY ((namespace, path), created_at, document_id) ) 
  WITH CLUSTERING ORDER BY (created_at DESC, document_id ASC);

This will:

Partition by both namespace and path.
Allow rows within unique combinations of namespace and path to be sorted according to your CLUSTERING ORDER.
Again, you should not need ALLOW FILTERING or ORDER BY in your query.

Cassandra time series modeling

Answers (2)

Related Questions