Cassandra slow reads (by partition key) for large data rows fetched

Question

I have a 5 node cassandra (3.11) cluster on AWS. Machine specs

Model   vCPU    Memory (GiB)    Instance Storage (GB)   Networking Bandwidth (Gbps)
i3.large    2   15.25   1 x 475 NVMe SSD    Up to 10

For my keyspace the replication factor = 2.

I need to store the movement records for vehicles in a table.

The columns are pretty straight-forward - vehicleId, lat, lng, timestamp

We have 100K vehicles and have historical 10 years of data of each.

The total volume of data is around 2 billion rows.

I need to read all the data for the given vehicle in ASC order to do some processing on it.

Earlier our table structure was

CREATE TABLE vehicle_movement (
vehicle_id int,
timestamp timestamp,
lat double,
lng double,
PRIMARY KEY (vehicle_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC)

Where vehicle_id was PARTITION KEY. 
AND the query was

SELECT lat, lng, timestamp from vehicle_movement where vehicle_id = xyz

The performance of read query was quite slow (~5 sec). We found that since each vehicle_id alone as partition key wont suffice as we have 300K+ records per vehicle and this may result in large partitions.

So we changed our schema a bit

Following is my updated table

    CREATE TABLE vehicle_movement (
    vehicle_id int,
    year int,
    timestamp timestamp,
    lat double,
    lng double,
    PRIMARY KEY ((vehicle_id, year), timestamp)
    ) WITH CLUSTERING ORDER BY (timestamp ASC)

    Here vehicle_id and year are composite partition keys 
which ensures that each partition may not contain more than 10k-15k records.

But since my problem statement remains the same - fetch all records for a given vehicle in ASC order, I further split the query into multiple

Eg
SELECT lat, lng, timestamp from vehicle_movement where vehicle_id = xyz and year IN (2018, 2019)


SELECT lat, lng, timestamp from vehicle_movement where vehicle_id = xyz and year IN (2020, 2021)

SELECT lat, lng, timestamp from vehicle_movement where vehicle_id = xyz and year IN (2022, 2023)

This approach reduced latency as compared to earlier but still it's not satisfactory (2 to 2.5 sec as compared to earlier 5 sec).

My client is spring based java app wherein I use spring-data-cassandra.

I execute these multiple queries and accumulate all the results.

For my reads, I have used consistency = LOCAL_ONE (which is fastest for reads)

I tried serial as well as parallel execution of the queries but still overall latency is more of less the same.

So my queries are:

What can be done to improve the latency further - I have to read all the data of given vehicle order by timestamp ASC to do processing?
Is cassandra even the right choice given I have high throughput reads?
I even gave SCYLLA DB a go as it is a drop-in replacement for cassandra with better benchmarks. SCYLLA gave me improved latency by around 30%

EDIT 2

I de-normalized my schema a bit and it have improved my read latencies significantly.

CREATE TYPE movement_point (
     lat DOUBLE,
     lng DOUBLE,
     ts TIMESTAMP
 );
 
  CREATE TABLE vehicle_movement_denorm (
      vehicle_id INT,
      year INT,
      points LIST>, // stored in ASC order
      count INT, // As there is no built in function for count
      PRIMARY KEY (vehicle_id, year)
  ) WITH CLUSTERING ORDER BY (year ASC);

Fortunately for each vehicle and year, we have 7000-8000 records, which are fitting nicely into frozen list whilst maintaining order.

While reading data, Im querying for 3-4 years in a single call. So 2-3 calls to fetch my data.

SELECT points from vehicle_movement_denorm where vehicle_id = xyz and year in (2020, 2021, 2022) 

SELECT points from vehicle_movement_denorm where vehicle_id = xyz and year in (2023, 2024)

This approach is giving me read latencies of 200-400 millis which is quite good. Is this denormalization approach good and scalable?

Cassandra slow reads (by partition key) for large data rows fetched

Answers (1)

Related Questions