Reputation: 5361

Are Cassandra user defined data types recommended in view of performance?

I have a Cassandra Customers table which is going to keep a list of customers. Every customer has an address which is a list of standard fields:

{
   CustomerName: "",
   etc...,
   Address: {
              street: "",
              city: "",
              province: "",
              etc...
            }
}

My question is if I have a million customers in this table and I use a user defined data type Address to keep the address information for each customers in the Customers table, what are the implications of such a model, especially in terms of disk space. Is this going to be very expensive? Should I use the Address user defined data type or flattent the address information or even use a separate table?

Upvotes: 7

Answers (2)

BatteryBackupUnit

Reputation: 13233

With Cassandra 5, in a test scenario of ours we have compared a table schema with/without a UDT. The UDT Version:

uses ~12% more disk space
achieves 31% more write/read throughput
at 18% less CPU load on cassandra node

So I do think that the difference can be significant enough. However, I also propose that you benchmark for yourself because your mileage may vary, wildly.

For completeness' sake here's the schema spec of what we've compared:

Keyspace

CREATE KEYSPACE smmv
  WITH replication = {'class': 'NetworkTopologyStrategy', 'xxx': '3'}
  AND durable_writes = true;

Table

Without UDT

CREATE TABLE smmv.sm1 (
    meter_id uuid,
    year tinyint,
    time timestamp,
    sensor_value1 bigint,
    ... same for 2-11
    sensor_value12 bigint,
    PRIMARY KEY ((meter_id, year), time)
) WITH CLUSTERING ORDER BY (time ASC)

With UDT

create type smmv.mv_udt (
    v1 bigint,
    ... same for 2-11
    v12 bigint
);

create table smmv.udt1 (
    meter_id UUID,
    year tinyint,
    time timestamp,
    measurement FROZEN<mv_udt>,
    PRIMARY KEY ((meter_id, year), time)

Upvotes: 0

Stefan Podkowinski

Reputation: 5249

Basically what happens in this case is that Cassandra will serialize instances of address into a blob, which is stored as a single column as part of your customer table. I don't have any numbers at hand on how much the serialization will put on top on disk or cpu usage, but it probably will not make a big difference for your use case. You should test both cases to be sure.

Edit: Another aspect I should also have mentioned: handling UDTs as single blobs will imply to replace the complete UDT for any updates. This will be less efficient than updating individual columns and is a potential cause for inconsistencies. In case of concurrent updates both writes could overwrite each others changes. See CASSANDRA-7423.

Upvotes: 4