Best practice to update large number of rows in Cassandra reliably (relational update)

Question

I have a few tables that are related to each other, it looks something like this:

organizations: 
- id
- name
- ... other fields

users:
- id
- name
- organization_id
- organization_name
- ... other fields

I keep organization_name field in the users table so that it doesn't have to look up to organization to get the organization name

The problem is that if organization name is changed, all users related to the organization must be updated to reflect the new name. In my real scenario there are more tables where I store organization_name on.

Problem: Currently I just fire up the update statement asynchronously and if it fails halfway then I'll end up with inconsistent data

Question: Is there a best practice how to deal with this sort of issue?

Possible solutions:

Using BATCH statement. But I found it very limiting since by default it only allows 50kb query size (in my case 1 update might lead up to updating 8,000 other entities from two or three different tables with varying length in the field values - so query size is rather unpredictable)
- I actually tried using BATCH statement to update 100 items (out of 600 that needs to be updated) and it failed with "Batch Size Too Large" exception...
Retry on failed update

PS - my rows are not too wide, at most I have about 20 columns per table

Update:

Forgot to add, this is a webapp where update needs to be reflected as soon as possible, so batch job won't be applicable

Update 2:

Regarding read pattern, my current example is oversimplified, but in any case I would require to fetch list of users (it can be from multiple organizations) - this might return over thousands of users over hundreds of organizations which is why I stored organization_name in the users table as my understanding is that with Cassandra data denormalization is the way to go

xmas79 · Accepted Answer

Like in every long-running update process, you should use the concept of bookmark:

Run jobs of (say 100) async updates and then store somewhere that you just done updating 100 rows.
Run another job of another 100 rows and then bookmark you've just updated 200 rows.
And so on...

In the event of a crash, you will just resume where you crashed by reading your bookmark.

To perform such task you must already know what records you have to update, but I'm assuming you already know them or know how to retrieve that information.

Best practice to update large number of rows in Cassandra reliably (relational update)

Answers (2)

Related Questions