Coder
Coder

Reputation: 3262

Nested query in cassandra

I am a newbie to cassandra, I tried to insert the employee with the list of laptop mapping as follows where the 'laptoplist' is an UDT.

   cqlsh:sourceutilization> SELECT * from employee ;

 id | laptoplist                                                                        | name      | type
----+-----------------------------------------------------------------------------------+-----------+------------
  5 |  [{laptopid: 5, cpu: 9, memory: 18, networkutilization: 25, diskutilization: 85}] | testname5 | staffType5
  1 | [{laptopid: 1, cpu: 94, memory: 36, networkutilization: 13, diskutilization: 66}] | testname1 | staffType1
  8 |  [{laptopid: 8, cpu: 64, memory: 1, networkutilization: 15, diskutilization: 71}] | testname8 | staffType8
  0 |  [{laptopid: 0, cpu: 4, memory: 95, networkutilization: 20, diskutilization: 16}] | testname0 | staffType0
  2 | [{laptopid: 2, cpu: 49, memory: 37, networkutilization: 20, diskutilization: 88}] | testname2 | staffType2
  4 | [{laptopid: 4, cpu: 13, memory: 67, networkutilization: 67, diskutilization: 10}] | testname4 | staffType4
  7 | [{laptopid: 7, cpu: 11, memory: 75, networkutilization: 75, diskutilization: 97}] | testname7 | staffType7
  6 |  [{laptopid: 6, cpu: 27, memory: 34, networkutilization: 2, diskutilization: 92}] | testname6 | staffType6
  9 | [{laptopid: 9, cpu: 12, memory: 10, networkutilization: 19, diskutilization: 73}] | testname9 | staffType9
  3 | [{laptopid: 3, cpu: 47, memory: 13, networkutilization: 72, diskutilization: 54}] | testname3 | staffType3

Now, I want to query something like below, How it is possible

select * from employee where laptoplist.networkutilization > 50;

FYI, I am using 3.1 cassandra version.

Thanks in advance, Harry

Upvotes: 2

Views: 4708

Answers (2)

sachin
sachin

Reputation: 379

You can not just do a range query on any column . There is some restriction in cassandra .

Before creating any schema on cassandra you must have to be specific in what way you want to do a query otherwise most of time your schema won't work .

To do a range query like greater than , greater than equal , less than ,less than equal you need to specify the clustering column in the schema.

We can not simply specify only Clustering column in cassandra. You must need to declare the partition key in every schema of cassandra.

To do a query on clustering column you have to pass all the previous primary key's value in query.

Upvotes: 0

Aaron
Aaron

Reputation: 57748

This isn't going to work well, as-is. There are more than a few changes needed to get what you want here. In Cassandra there are two things that can usually help.

  1. If you're having trouble with a data model, ask yourself what it might look like as a time series.

With Cassandra's distributed, append-only storage engine, use cases like time series and event tracking fit easily. And sometimes a data model makes more sense (from a Cassandra point of view) when adjusted to that perspective.

  1. Build your tables to match your query patterns.

I see what is probably a PRIMARY KEY on ID. But what I don't see (above, at least) are any queries that filter on ID. I can tell that things like employees and laptops are important, and probably unique. But unique keys don't always make the best information filters.

The main question to ask is, what are you trying to get here?

To me, it looks like you want to see users that are experiencing high network utilization. And high network utilization is a (hopefully) temporary thing, so why don't we add a time component to it (checkpoint_time)? IMO, it makes sense to track computing resource utilization over time. After considering those points, I came up with a data model like this:

cassdba@cqlsh:stackoverflow> CREATE TABLE employee_laptop__by_network_utilization (
                       timebucket text,
                       checkpoint_time timestamp,
                       employee_id bigint,
                       name text,
                       type text,
                       laptop_id bigint,
                       cpu bigint,
                       memory bigint,
                       network_utilization bigint,
                       disk_utilization bigint,
                       PRIMARY KEY ((timebucket),network_utilization,
                           checkpoint_time,employee_id,laptop_id)
                     ) WITH CLUSTERING ORDER by 
                          (network_utilization ASC, checkpoint_time DESC,
                           employee_id ASC, laptop_id ASC);

After inserting some rows, I can now query employee/laptop combinations who were experiencing network utilization > 50 on October 12, 2017.

cassdba@cqlsh:stackoverflow> SELECT * FROm employee_laptop__by_network_utilization
     WHERE timebucket='20171012' AND network_utilization > 50;

 timebucket | network_utilization | checkpoint_time                 | employee_id | laptop_id | cpu | disk_utilization | memory | name     | type
------------+---------------------+---------------------------------+-------------+-----------+-----+------------------+--------+----------+-----------
   20171012 |                  55 | 2017-10-12 12:30:00.000000+0000 |           1 |         1 |   4 |               62 |     19 | Jebediah |     Pilot
   20171012 |                  55 | 2017-10-12 12:15:00.000000+0000 |           1 |         1 |  19 |               62 |     18 | Jebediah |     Pilot
   20171012 |                  72 | 2017-10-12 12:00:00.000000+0000 |           3 |         3 |  47 |               54 |     13 |      Bob | Scientist

(3 rows)

First of all, I needed a good partition key that would both make sense for the queries, and prevent my partitions from unbound growth. Therefore, I picked a timebucket named "date bucket." This way, I can isolate my queries for a single day, and ensure that each query is served by a single node.

Next, I clustered on network_utilization, as that's the main column that this model is primarily concerned with. It is the first clustering column, as we don't want to have to provide too much more in the way of filtering columns in our query.

checkpoint_time is the next column in the PRIMARY KEY, mainly because requests with the same timebucket and network_utilization will probably make more sense sorted by time (DESCending).

Finally, I added employee_id for uniqueness, and then laptop_id because an employee could have more than one laptop.

Now, I'm sure you're going to find aspects of my solution that don't exactly fit with your use case. And that's because Cassandra data modeling is very use-case-centric. Often one good solution isn't a cookie-cutter-fit for another. But, it's one way to get the data that you're after.

Upvotes: 3

Related Questions