Alex Tbk
Alex Tbk

Reputation: 2104

Handling of multiple queries as one result

Lets say I have this table

CREATE TABLE device_data_by_year (
    year int,
    device_id uuid,
    sensor_id uuid,
    nano_since_epoch bigint,
    unit text,
    value double,
    source text,
    username text,
    PRIMARY KEY (year, device_id, nano_since_epoch,sensor_id)
) WITH CLUSTERING ORDER BY (device_id desc, nano_since_epoch desc);

I need to query data for a particular device and sensor in a period between 2017 and 2018. In this case 2 queries will be issued:

select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?

select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?

Currently I iterate over the resultsets and build a List with all the results. I am aware that this could (and will) run into OOM problems some day. Is there a better approach, how to handle / merge query results into one set?

Thanks

Upvotes: 2

Views: 703

Answers (2)

Alex Tbk
Alex Tbk

Reputation: 2104

Changed the table structure to:

CREATE TABLE device_data (
   week_first_day timestamp,
   device_id uuid,
   sensor_id uuid,
   nano_since_epoch bigint,
   unit text,
   value double,
   source text,
   username text,
   PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
) WITH CLUSTERING ORDER BY (nano_since_epoch desc, sensor_id desc);

according to @AlexOtt proposal. Some changes to the application logic are required - for example findAllByYear needs to iterate over single weeks now.

Coming back to the original question: would you rather send 52 queries (getDataByYear, one query per week) oder would you use the IN operator here?

Upvotes: 0

Alex Ott
Alex Ott

Reputation: 87174

You can use IN to specify a list of years, but this is not very optimal solution - because the year field is partition key, then most probably the data will be on different machines, so one of the node will work as "coordinator", and will need to ask another machine for results, and aggregate data. From performance point of view, 2 async requests issued in parallel could be faster, and then do the merge on client side.

P.S. your data model have quite serious problems - you partition by year, this means:

  • Data isn't very good distributed across the cluster - only N=RF machines will hold the data;
  • These partitions will be very huge, even if you get only hundred of devices, reporting one measurement per minute;
  • Only one partition will be "hot" - it will receive all data during the year, and other partitions won't be used very often.

You can use months, or even days as partition key to decrease the size of partition, but it still won't solve the problem of the "hot" partitions.

If I remember correctly, Data Modelling course at DataStax Academy has an example of data model for sensor network.

Upvotes: 2

Related Questions