Reputation: 2697
I've designed the database for the web-app i'm building "by the book". That is, I've:
Everything is going well (so far). I've denormalized before with great results, and am curently implementing a part of the app which will use data that hasn't been denormalized yet. Doing so for this particular part will, I predict, increase performance somewhat substantially (reading from 1 Column_Family ("table" in the relational world) instead of 7).
However, I fear that I may be denormalizing too much. If I were to to do so for the part in question, it would pretty much reduce the Column_Family/table count in my app by about 20%, and having that much of my database denormalized makes me nervous for some reason.
Should the app end up being enough of a success that I'm able to get a database designer or administrator on board, I'd like for him to be able determine that the denormalization I'm performing is necessary for the performance i'm seeking (best-case) or at the very least not harmful (worst-case).
Are there specific things I should look out for when making denormalization decisions that may indicate whether doing so would be bad, or does it always come down to speed vs. maintainability?
Upvotes: 4
Views: 3452
Reputation: 25140
Designing a schema for cassandra is very different than designing a schema for a sql database. With a sql database your data fits on one machine, the database will maintain indexes for you, you can perform joins, and you can do complex queries with sql. These all make normalizing data practical.
In cassandra you data does not fit on one machine so you can't perform joins, the only query you can do efficiently is get a range of columns on a key, and cassandra will only maintain limited indexes for you. This makes normalizing your data impractical.
In cassandra, you typically design your schema to serve the queries that you are going to make, and you denormalize to do that. My favorite example of this is what twitter does for their stats for rainbird as explained in this post,
For example, say someone clicks on a t.co link to blog.example.com/foo at 11:41am on 1st Feb.
Rainbird would increment counters for:
t.co click: com (all time)
t.co click: com.example (all time)
t.co click: com.example.blog (all time)
t.co click: com.example.blog /foo (all time)
t.co click: com (1st Feb 2011)
t.co click: com.example (1st Feb 2011)
t.co click: com.example.blog (1st Feb 2011)
t.co click: com.example.blog /foo (1st Feb 2011)
t.co click: com (11am-12 on 1st Feb)
t.co click: com.example (11am-12 on 1st Feb)
t.co click: com.example.blog (11am-12 on 1st Feb)
t.co click: com.example.blog /foo (11am-12 on 1st Feb)
t.co click: com (11:41-42 on 1st Feb)
t.co click: com.example (11:41-42 on 1st Feb)
t.co click: com.example.blog (11:41-42 on 1st Feb)
t.co click: com.example.blog /foo (11:41-42 on 1st Feb)
This 1 click is copied 16 times to satisfy the 16 queries that can be done.
This is a good presentation on how to do indexing in cassandra.
Upvotes: 10
Reputation: 57748
Denormalizing for the sake of performance is not a bad thing. What you need to consider are the goals of your application/database, and how normalization can help you to achieve them.
First of all, putting a table in 1NF involves eliminating redundant data or (Coronel, Rob 2009) "repeating groups." Eliminating data in multiple locations (be it other tables or rows) is a good thing, and helps with maintenance, data integrity and performance.
Getting to 2NF involves eliminating partial dependencies. Partial dependencies exist when you have a composite key (primary key composed of multiple key fields) and fields whose value is determined by only one or part of the key. Typically, elimination of partial dependencies is where you start to see bridge tables created to handle many-to-many relationships.
3NF is a step further, in that it eliminates all transitive dependencies, or fields that are dependent on the value of non-key fields. This step is one that is often negotiable in the name of performance. Depending on the size or variance of values of the transitive fields, you'll want to weight-out the challenges of keeping those values in the table, vs. how often you're going to have to JOIN to get it.
The bottom line, the elimination of redundant data, and dependent data (partial and transitive) is a good thing. But don't let it stop you from doing what makes sense for your application.
C. Coronel, P. Rob (2009), "Database Systems: Design Implementation and Management", Course Technology, Boston, MA (Ch. 5)
Upvotes: 1
Reputation: 12804
Generally you want as much normalization as you can tolerate, especially with relation to tables you think likely to grow large. I have skipped normalization of very small data sets or directly related data, but never to improve performance reasons (that's what reporting servers and ETL are for); I find the extra effort in design and rejoining very small, directly-related, rarely-changing tables is a waste of time from a development perspective.
My biggest concerns with denormalizing are data integrity and space waste (on disk and memory) in that order.
My only concern with normalization is maintainability; making something very simple far more complex than it really needs to be is generally fruitless. Normalization for normalization's sake is fanatical as far as I am concerned and only the Sith deal in absolutes.
Upvotes: 1