Reputation: 145

Cassandra column family design

I'm having trouble designing a column family that suits the following requirement: I would like to update X rows that match some condition for a field that is not the primary key and is not unique.

For example if a User column family has ID, name and birthday columns, I would like to update all the users that were born after some specific day.
Even if I add the 'birthday' to the primary key (lets say 'ID', 'birthday') I cannot perform this query because part of the primary key is missing.

How can i approach this by designing my column family differently ?
Thanks.

Upvotes: 2

Answers (1)

shutty

Reputation: 3348

According to cassandra docs, there is no way to update rows without explicitly defining their partition key. This was done not by an accident, but because this feature (e.g. update users set status=1 where id>10) can allow user to update all data in table at once, which can be very-very-very expensive on large databases. Cassandra explicitly forbids all operations requiring data scans within multiple partitions.

To update multiple users all at once, you have to know their IDs. Having a table defined as:

CREATE TABLE stackoverflow.users (
    id timeuuid PRIMARY KEY,
    dob timestamp,
    status text
)

and knowing user's primary key, you can run queries like update users set status='foo' where id in (1,2,3,4). But queries with really large sets of keys inside IN statement may cause performance issues on C*.

But how can you have an efficient range query like select id from some_table where dob>'2000-01-01 00:00:01'? There are two options available, and both of them are not really acceptable:

Create an index table like CREATE TABLE stackoverflow.dob_index ( year int, dob timestamp, ids list<timeuuid>, PRIMARY KEY (year, dob) ) with compound partition+clustering primary key and use multiple queries like select * from dob_index where year=2014 and dob<'2014-05-01 00:00:01'; to fetch ids for different years. Notice that I've defined multiple partitions for the table to have some kind of even partition distribution in cluster. But the general idea is that you really shouldn't have a small amount of very large partitions. Prefer a large amount of small ones, if there's a choice.
Have a separate stand-alone index available for complex queries (like ElasticSearch/Solr/Sphinx).

But I suggest you to revisit your application logic in a way to avoid updating/deleting data at all:

instead of updating users table directly, you can have a separate table user_status you insert new statuses: CREATE TABLE user_statuses ( id timeuuid, updated_at timestamp, status text, PRIMARY KEY (id, updated_at) )
When you need to scan/update a lot of rows at once, prefer using tools like Spark to efficiently distribute your workload among your cluster nodes.

Upvotes: 1

Cassandra column family design

Answers (1)

Related Questions