Cassandra - one big table vs many tables

Question

I'm currently trying out Cassandra.
I'm using DataStax DevCenter and the DataStax C# driver.

My current model is quite simple and consists of only:

ParameterId (int) - would serve as the id of the table.
Value (bigint)
MeasureTime (timestamp)

I will be having 1000 (no more, no less) parameters, from 1 - 1000. And I will be getting an entry for each parameter once per second. This will be running for years.

My question is regarding whether is it a better practice to create a table as:

CREATE TABLE keyspace.measurement (
    parameterId int,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY(parameterId, measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

Or it would be better to create 1000 tables consisting only of a value and measureTime, and if so would I be able to range query on my MeasureTime?

xmas79 · Accepted Answer

You are going to hit very wide rows with this. I would advise against your table format, and I'd go with something that allows you to control the wideness of the rows.

Depending on your query requirements, I'll write you down a more suitable schema (IMHO):

CREATE TABLE keyspace.measurement (
    parameterId int,
    granularity timestamp,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY((parameterId, granularity), measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

This is very similar to yours, however it has a major advantage: you can configure the wideness of your rows, and you don't have any hotspots. The idea is dead simple: both parameterId and granularity fields make the partition key, so they tell where your data will go, while measureTime will keep your data ordered. Supposing you want to query on a day-by-day basis, you'd store into granularity the value yyyy-mm-dd of your measureTime, grouping together all the measures of the same day.

This allows you to retrieve all the values lying on the same partition (so per given parameterId and granularity fields pair) with an efficient range query. In a day-by-day configuration, you'd end up with 86400 records per partition. This number could be still high (the suggested limit is 10k IIRC), and you can lower tht value by going on hour-by-hour grouping with yyyy-mm-dd HH:00 value instead.

The drawback of that approach is that if you need data from multiple partitions (eg you are grouping on day-by-day basis, but you need data for two consecutive days, eg the last 6 hours of the Jan 19th, and the first 6 hours of Jan 20th), then you'll need to perform multiple queries.

Cassandra - one big table vs many tables

Answers (2)

Related Questions