Alex Pan
Alex Pan

Reputation: 4571

Cassandra: backfilling new boolean field

I added a new boolean column called subscribe to my email_subscriptions table in Cassandra. I noticed that it returns false for all rows' subscribe field.

I wanted to default all rows in the table with the subscribe field as true, but this StackOverflow answer says:

there is no default value in Cassandra.

So my question is, how do I set all rows in my email_subscriptions table to have their subscribe field set to true? Do I need to backfill via a batch update?

Upvotes: 1

Views: 1582

Answers (2)

xmas79
xmas79

Reputation: 5180

The only way is to fill back your whole table. Depending on size of your table, you could have problems at querying your table with a simple

SELECT * FROM mytable;

due to timeouts. Since the partition key is mandatory in an UPDATE statement, you have to find a way to spill your partition keys out from that table.

This is the perfect scenario for using the TOKEN function. Assuming you did your homework and don't have any too-wide partitions, you can scan all your dataset by splitting it into ranges of partitions. How wide is your range is up to your data. From a general point of view, you need to:

SELECT __partition_key_columns__ FROM mytable WHERE 
    TOKEN(__partition_key_columns__) >= min_range AND 
    TOKEN(__partition_key_columns__) < max_range;

and min_range and max_range go from -2^63 to 2^64-1 (IIRC, using Murmur3) in steps of a guessed window size W:

SELECT __partition_key_columns__ FROM mytable WHERE TOKEN(__partition_key_columns__) >= -2^63 AND TOKEN(__partition_key_columns__) < -2^63 + W;
SELECT __partition_key_columns__ FROM mytable WHERE TOKEN(__partition_key_columns__) >= -2^63 + W AND TOKEN(__partition_key_columns__) < -2^63 + 2*W;
...

until you covered all the range up to 2^64-1. If you get a timeout make W smaller and try again. And if you don't, expand your window W so you'll speed up the process. You will be able to extract all the partitions to issue the updates for each range.

EDIT: This blog post explains exactly how to perform such task.

Upvotes: 1

Ashraful Islam
Ashraful Islam

Reputation: 12830

There is two way you can do this.

  • You can create a program, which will select all record from email_subscriptions and insert back along with subscribe = true value.

    Or

  • When selecting subscribe field value, check the value is null with isNull() method (which will return true if the column is null, false otherwise). If true return, it means that subscribe is null and this value not yet inserted, you can treat it as subscribe = true

Upvotes: 4

Related Questions