Larzix
Larzix

Reputation: 55

Cassandra - one big table vs many tables

I'm currently trying out Cassandra.
I'm using DataStax DevCenter and the DataStax C# driver.

My current model is quite simple and consists of only:

I will be having 1000 (no more, no less) parameters, from 1 - 1000. And I will be getting an entry for each parameter once per second. This will be running for years.

My question is regarding whether is it a better practice to create a table as:

CREATE TABLE keyspace.measurement (
    parameterId int,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY(parameterId, measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

Or it would be better to create 1000 tables consisting only of a value and measureTime, and if so would I be able to range query on my MeasureTime?

Upvotes: 3

Views: 2127

Answers (2)

xmas79
xmas79

Reputation: 5180

You are going to hit very wide rows with this. I would advise against your table format, and I'd go with something that allows you to control the wideness of the rows.

Depending on your query requirements, I'll write you down a more suitable schema (IMHO):

CREATE TABLE keyspace.measurement (
    parameterId int,
    granularity timestamp,
    value bigint,
    measureTime timestamp,
    PRIMARY KEY((parameterId, granularity), measureTime)
) WITH CLUSTERING ORDER BY (measureTime DESC)

This is very similar to yours, however it has a major advantage: you can configure the wideness of your rows, and you don't have any hotspots. The idea is dead simple: both parameterId and granularity fields make the partition key, so they tell where your data will go, while measureTime will keep your data ordered. Supposing you want to query on a day-by-day basis, you'd store into granularity the value yyyy-mm-dd of your measureTime, grouping together all the measures of the same day.

This allows you to retrieve all the values lying on the same partition (so per given parameterId and granularity fields pair) with an efficient range query. In a day-by-day configuration, you'd end up with 86400 records per partition. This number could be still high (the suggested limit is 10k IIRC), and you can lower tht value by going on hour-by-hour grouping with yyyy-mm-dd HH:00 value instead.

The drawback of that approach is that if you need data from multiple partitions (eg you are grouping on day-by-day basis, but you need data for two consecutive days, eg the last 6 hours of the Jan 19th, and the first 6 hours of Jan 20th), then you'll need to perform multiple queries.

Upvotes: 8

Clement Amarnath
Clement Amarnath

Reputation: 5476

We are having two methods here and each has its own pros and cons.

Method 1: Create 1 table per parameter(1000 tables consisting only of a value and measureTime)

This approach would be good if we have only limited number of parameters, in near future, if we need to accommodate more parameters then creating one table per parameter would become cumbersome. Performance can be made better by putting the tables across different shards.

Method 2: Creating one big table

NoSql DB's are designed for better performance for higher number of records. Even having billion records would also give good performance.

Considering this point "will be getting an entry for each parameter once pr. second and will be running for years.", I feel that Method 1 would be best fit for your scenario provided that there won't be increase in number of parameters in the future.

Upvotes: 0

Related Questions