Reputation: 5684
I have a 40 column RDBMS table which I am porting to Cassandra.
Using the estimator at http://docs.datastax.com/en/cassandra/2.1/cassandra/planning/architecturePlanningUserData_t.html
I created a excel sheet with column names, data types, size of each column etc. The Cassandra specific overhead for each RDBMS row is a whopping 1KB when the actual data is only 192 bytes.
Since the overheads are proportional to number of columns, I thought it would be much better if I just create a UDT for the fields that are not part of the primary key. That way, I would incur the column overhead only once.
Also, I don't intend to run queries on inner fields of the UDT. Even if I did want that, Cassandra has very limited querying features that work on non PK fields.
Is this a good strategy to adopt? Are there any pitfalls? Are all these overheads easily eliminated by compression or some other internal operation?
Upvotes: 2
Views: 244
Reputation: 57748
On the surface, this isn't a bad idea at all. You are essentially abstracting your data by another level, but in a way that it is still manageable to meet your needs. It's actually good thinking.
I have a 40 column RDBMS table
This part slightly worries me. Essentially, you'd be creating a UDT with 40 properties. Not a huge deal in and of itself. Cassandra should handle that just fine.
But while you may not be querying on the inner fields of the UDT, you need to ask yourself how often you plan to update them. Cassandra stores UDTs as "frozen" types in a single column. This is important to understand for two reasons:
So you should keep that in mind while designing your application. As long as you won't be writing frequent updates to individual properties of the UDT, this should be a good solution for you.
Upvotes: 2