Santosh Tulasiram
Santosh Tulasiram

Reputation: 198

Cassandra time series modeling

I have a table like this.

> CREATE TABLE docyard.documents (
>     document_id text,
>     namespace text,
>     version_id text,
>     created_at timestamp,
>     path text,
>     attributes map<text, text>
>     PRIMARY KEY (document_id, namespace, version_id, created_at) ) WITH CLUSTERING ORDER BY (namespace ASC, version_id ASC, created_at
> ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';

I want to be able to do the range queries on following conditions-

select * from documents where namespace = 'something' and created_at> 'some-value' order by created_at allow filtering;

select from documents where namespace = 'something' and path = 'something' and created_at> 'some-value' order by created_at allow filtering;

I am not able to make these queries work in any manner. Tried secondary indexes as well. Can anyone please help?

I keep getting some or the other when trying to make it work.

Upvotes: 1

Views: 625

Answers (2)

Aaron
Aaron

Reputation: 57798

First of all, don't use secondary indexes or ALLOW FILTERING. With timeseries data that will perform terribly over time.

To satisfy your first query, you will want to restructure your PRIMARY KEY and CLUSTERING ORDER like this:

PRIMARY KEY (namespace, created_at, document_id) ) 
WITH CLUSTERING ORDER BY (created_at DESC, document_id ASC);

This will allow for the following:

  • Partitioning by namespace.
  • Sorting by created_at in DESCending order (most-recent rows read first).
  • Uniqueness by document_id
  • You will not need ALLOW FILTERING or ORDER BY in your query, as the necessary keys will be provided, and the results will already be sorted to your CLUSTERING ORDER.

For your second query, I would create an additional query table. This is because in Cassandra, you need to model your tables to suit your queries. You may end-up having several query tables for the same data, and that's ok.

CREATE TABLE docyardbypath.documents (
  document_id text,
  namespace text,
  version_id text,
  created_at timestamp,
  path text,
  attributes map<text, text>
PRIMARY KEY ((namespace, path), created_at, document_id) ) 
  WITH CLUSTERING ORDER BY (created_at DESC, document_id ASC);

This will:

  • Partition by both namespace and path.
  • Allow rows within unique combinations of namespace and path to be sorted according to your CLUSTERING ORDER.
  • Again, you should not need ALLOW FILTERING or ORDER BY in your query.

Upvotes: 3

Myles Baker
Myles Baker

Reputation: 3760

I think you need to review how data modeling works in Cassandra.

The first query can look like this:

select * from documents where namespace = 'something' and created_at > 'some_formatted_date'  and document_id='someid' and version_id='some_version' order by namespace, version_id, created_at allow filtering;

When querying a Cassandra table, you must:

  1. Provide all the items in the primary key during select
  2. Order by following the clustering order

Fixing the second query is straightforward. What are you trying to do? Cassandra is optimized for write performance. You may want to write this data into multiple tables for each group of queries you plan to run.

Upvotes: 1

Related Questions