Reputation: 48702
How do Cassandra's handling of updates and cluster keys interact?
It strikes me that these two features might interact badly, causing generation of excessive garbage.
Consider this schema:
CREATE TABLE t (
p int,
c int,
d string,
PRIMARY KEY ((p), c),
);
After execution of the following insertions:
INSERT INTO t (p, c, d) VALUE (1, 1, "text-1");
INSERT INTO t (p, c, d) VALUE (1, 2, "text-2");
is there a tombstone-marked record holding the (1, 1, "text-1")
data and a new record holding both the (1, 1, "text-1")
and (1, 2, "text-2")
data? That is, has the second insert been implemented as an update of the "real" record that has a partition key (p
) of 1?
Upvotes: 2
Views: 202
Reputation: 8985
Your assumption is incorrect. In your schema, p
is the partition (or "row") key, and c
is a clustering column. Cassandra is a columnar store, so writes are essentially a collection of sparse, ordered columns attached to a partition. It's possible to achieve additional nesting by creating composite row keys and column names, which in your case translates to a storage model that looks like this:
Row Key: 1 =>
1:d => "text-1"
2:d => "text-2"
If you were to insert another partition key, like this:
INSERT INTO t (p, c, d) VALUE (2, 1, "text-1");
your storage model would look like this:
Row Key: 1 =>
1:d => "text-1"
2:d => "text-2"
Row Key: 2 =>
1:d => "text-1"
So you can observe that these column values (1:d
, 2:d
, etc), are treated independently. Suppose you then delete one of those values:
DELETE FROM t WHERE p = 1 AND c = 1;
your result would be:
Row Key: 1 =>
1:d => "text-1" + [tombstone]
2:d => "text-2"
Row Key: 2 =>
1:d => "text-1"
where the tombstone would have a greater timestamp and therefore "cover" the original value, until compaction cleans this up. When exactly this occurs depends on a number of factors (value of gc_grace_seconds
, compaction strategy, workload, etc).
Upvotes: 2
Reputation: 1
It is my understanding that Cassandra does not delete records on insert/update (upsert), it simply records the new information as a write and does not create a tombstone. When the information is read, it will utilize a timestamp to determine which data is the most up to date. The old records are removed during compaction while tombstones will live on until the grace period expires (default 10 days) to help maintain consistency for a delete so that they are not resurrected.
Upvotes: 0