Reputation: 5396
I'm new in Cassandra, recently I watched very good tutorials on DataStax that was about data modeling.
As I understood, in Cassandra we always have to have different tables for queries we want to have, for example even a simple query for sorting data by time or id.
It means we have to have some tables for each entity that has data according to query we want to have later. Imagine we have Videos, and we may have some tables for videos entity.
First question, as I said for each query we must have a table, we are going to have different kind of sorts on video by different columns:
our columns for video table are:
video_id | video_title | video_create_year | director | timestap
Now should we make other tables for other sorting we may need?
We may need to sort the table by director name (ASC | DESC), video_create_year (ASC | DESC), video_title (ASC | DESC)
I am not sure, Should we make different tables for each different sorting?
Such as:
videos_by_diractor_asc
videos_by_diractor_desc
videos_by_title_asc
videos_by_title_desc
So on...
Did I understand it correctly?
Second question, if I understood it correctly, then I forgot to make a table that I may needed in our website (Imagine one day I get I forgot to have video_by_title_asc) then what should I do? Should I write a program and copy whole data from video table? Or there are some ways in Cassandra to copy whole data if it is necessary?
I hope the question was not confusing.
Upvotes: 1
Views: 724
Reputation: 2321
Okay, you're understanding Cassandra partially right.
I hope i understand you right. Your Primary Key of this tables would look like this:
videos_by_diractor_asc PRIMARY KEY(director)
videos_by_title_asc PRIMARY KEY(title)
But in this case you forgot one thing: The partition key. The partition key is the first part of the primary key. I think, in your case, the year make sense. All rows with the same partition key are always on the same node. Cassandra split your rows by the partition key. The columns after the partition keys, called column keys, are sorted. The partition keys itself are not sorted. This means: node1 can have year 2015, 1998 and 1950 and node2 2010, 1990, 1577. Cassandra evenly distribute the data between the nodes. On modelling, you have to think about one important thing: What are the expected size of my table inside one partition key. This mean, in the video case, how many rows do you expect in one year? 2 Mio? 1 bln? If you will get more than 2bln rows x column, you will have a huge problem. 2bln is the maximum size of each partition key. But remember: It's the maximum. I recommend not more than 500mio. I calculate, in the worst case, with 500mio.
So now we can talk about the column keys. Yes, every sorting needs a new table. And you also need a new table if you want to access data in your WHERE conditions in different orders. One Example: You have this primary key PRIMARY KEY(year, director, title)
The first is the partition key. This means: You always need the year in your where condition. Then your data, with the same partition key, sorted, default in ASC, by director. After the director by title. In this case you can't use this WHERE condition: WHERE year = 2016 and title = 'whatever'
Okay, now i will answer your main question :) The thing about the duplicated data. In Cassandra 3.0 you can use materialized views. Yes, it's a nice feature but it has his overhead. The best solution is to write a wrapper around cassandra. This wrapper only does one thing: It handles all this duplicated data. It knows what's the best way to access data if you need it sorted by title and then by director and not sorted by director and then by title. And one thing: Have no concerns to write data 5 or more times. Cassandra is optimized for writing. It's okay to write data. But don't forgot one thing: Cassandra is a database for known queries. If you know that you will need the data really often in this sorting order, create a table for it. But if you don't know it and you create this table only for the case when: Don't create a table. For this, sometimes queries, you can use spark or another solution.
And one more thing: If you need only to query data by one thing, like only by title, only by director, don't use cassandra for it. This is a main feature of a key value storage.
Upvotes: 2