Reputation: 5361
I have a Cassandra Customers table which is going to keep a list of customers. Every customer has an address which is a list of standard fields:
{
CustomerName: "",
etc...,
Address: {
street: "",
city: "",
province: "",
etc...
}
}
My question is if I have a million customers in this table and I use a user defined data type Address to keep the address information for each customers in the Customers table, what are the implications of such a model, especially in terms of disk space. Is this going to be very expensive? Should I use the Address user defined data type or flattent the address information or even use a separate table?
Upvotes: 7
Views: 1556
Reputation: 13233
With Cassandra 5, in a test scenario of ours we have compared a table schema with/without a UDT. The UDT Version:
So I do think that the difference can be significant enough. However, I also propose that you benchmark for yourself because your mileage may vary, wildly.
For completeness' sake here's the schema spec of what we've compared:
CREATE KEYSPACE smmv
WITH replication = {'class': 'NetworkTopologyStrategy', 'xxx': '3'}
AND durable_writes = true;
CREATE TABLE smmv.sm1 (
meter_id uuid,
year tinyint,
time timestamp,
sensor_value1 bigint,
... same for 2-11
sensor_value12 bigint,
PRIMARY KEY ((meter_id, year), time)
) WITH CLUSTERING ORDER BY (time ASC)
create type smmv.mv_udt (
v1 bigint,
... same for 2-11
v12 bigint
);
create table smmv.udt1 (
meter_id UUID,
year tinyint,
time timestamp,
measurement FROZEN<mv_udt>,
PRIMARY KEY ((meter_id, year), time)
Upvotes: 0
Reputation: 5249
Basically what happens in this case is that Cassandra will serialize instances of address into a blob, which is stored as a single column as part of your customer table. I don't have any numbers at hand on how much the serialization will put on top on disk or cpu usage, but it probably will not make a big difference for your use case. You should test both cases to be sure.
Edit: Another aspect I should also have mentioned: handling UDTs as single blobs will imply to replace the complete UDT for any updates. This will be less efficient than updating individual columns and is a potential cause for inconsistencies. In case of concurrent updates both writes could overwrite each others changes. See CASSANDRA-7423.
Upvotes: 4