Reputation: 3
I am trying to design a timeseries service based on Cassandra that will keep track of some log information. The database will see a relatively high volume of writes (expecting ~500mil inserts / day) and less frequent but large-volume reads (think one day of data or one month of data).
The simplified data model of one log entry looks like this (in reality it has 50 or so columns):
log_datetime date
log_some_field text
log_some_other_field text
Most read queries will revolve around selecting data from a certain date range, always ordered descending by date. (e.g. SELECT * FROM logs WHERE log_datetime >= 2012-01-01 and log_datetime <= 2012-02-01 ORDER BY log_datetime DESC
). This will normally take a considerable amount of time so I'd like to optimize for it as much as possible.
As ordering and filtering by date are the most important features as long as writes are not too terrible, the first idea was defining something like this (where log_day is the day of the year):
CREATE TABLE logs(
log_day tinyint
log_datetime timeuuid
log_some_field text
log_some_other_field text
PRIMARY KEY (log_day, log_datetime)
WITH CLUSTERING ORDER BY (log_datetime DESC)
)
It is my understanding that this would make retrieval as good as it gets as the data is ordered and a single partition is needed to retrieve one day (I can handle in the client the cases where several days are selected). However, this would make writes go to a single server which would considerably affect write performance. The other option is choosing some random set to be used as partition keys and distribute to each in a round-robin manner from the client, which would make writes faster and scalable but would lead to worse read performance especially if we have to re-sort the data. Most examples that I've seen usually have natural partition keys in the dataset like a user_id or a post_id which is not my case.
Did anybody here have a similar usecase? If so, what tradeoffs did you perform to get decent performance? Do you know of any databases that would perform better in such usecases?
Upvotes: 0
Views: 138
Reputation: 1381
As you note, using day as a partition key means writes go to a single primary node for an entire day. Data is replicated in Cassandra based upon replication factor, typically 3. Thus, three nodes would be written to on any given day.
If the data volume was low, this might be acceptable. Generally it is not and one would use some sort of time bucket, such as 5 or 10 minute intervals computed in the application.
CREATE TABLE logs(
log_day tinyint
timebucket tinyint
log_datetime timeuuid
log_some_field text
log_some_other_field text
PRIMARY KEY ((log_day, timebucket) log_datetime)
WITH CLUSTERING ORDER BY (log_datetime DESC)
)
The choice of an appropriate time interval for the bucket has to do with your expected data volume. With 500M writes per day, that is about 6K per second. Your time buckets could wrap on the hour, so you only have only 6 (using 10 minutes), or span an entire day having 144 unique buckets. When reading results, your application will have to read all buckets for a given day and merge (but not sort) the results.
In a syslog type application, using severity plus day in the partition key could help distribute the load across the cluster with a natural key. It would still be lumpy because the count of info msgs is a lot great than warning, error or fatal messages.
Upvotes: 1