How to model time series data in cassandra when data has non-uniform generation rate?

Question

I am planning to migrate data from my existing database (Postgres) to Cassandra. Here is a brief overview of the system:

Current data set size is around 2 Billion
Each data point represents an event. Properties of this event are - user_id, event_name, timestamp
This data is coming from a finite set of sources - For the sake of simplicity let's assume 3 different sources S1, S2, S3 - all of them pushing in a Kafka Topic. This cassandra microservice is consuming data from this topic.
The rate of data coming from S1, S2 and S3 is different. Assume S1 is pushing 1 event for a single user every minute, S2 is pushing 1 event for each user every 15 minutes and S3 is pushing single event for each user every 1 hour.
There are two types of queries this system should support
- Get latest event for a given user
- Get list of events for a given user and date range (This data range can have diff of at most 30 days)

I am trying to model this data using few different approaches.

Partition data for a single user into monthly buckets. For this additional parameters timestamp_year, timestamp_month are added. timestamp is used a cluster key.
- Pros: Less than 10ms write latency. Max partition size is around ~60MB (working good for cassandra 3.11). Get latest event is working in less than 10ms (99.999 percentile).
- Cons: Getting month level data is slow because of too much data being read from a single partition. If i put limit on number of records being fetched (let's say 10000) the latency improves. Partition size is non-uniform because of different rate of data from 3 different sources.

I have tried using weekly buckets instead of monthly buckets and pagination to improve on other parameters. But this is something i am not able to sort out Partition size is non-uniform because of different rate of data from 3 different sources.

How can i keep partition size consistent (almost) in such a data model? Ideas are welcome.

How to model time series data in cassandra when data has non-uniform generation rate?

Answers (1)

Related Questions