Reputation:

Would denormalizing my database in this particular way help performance?

In the site I am currently working on members can favorite other members. Then when a member goes to their favorites page they can see all the members they have favorited throughout time.

I can go about this in 2 ways:

Method #1:

Every time a user favorites another I enter a row in the favorites table which looks like this (the index is user_favoriting_id):

id | user_favorited_id | user_favoriting_id
-------------------------------------------

Then when they load the "My Favorites" page, I do a select on the favorites table to find all the rows where the user_favoriting_id value equals to that of the present logged in user. I then take the user_favorited_ids to build out a single SELECT statement and look up the respective users from a separate users table.

Method #2:

Each time a user favorites another I will update the favorites field on their row in the users table, which looks something like this (albeit with more fields, the index is id):

id | username | password | email | account_status | timestamp | favorites
--------------------------------------------------------------------------

I will CONCAT the id of the user being favorited in the favorites field so that column will hold a comma separated string like so:

10,44,67 etc...

Then to produce the My Favorites page like method #1 I will just grab all the favorite users with one select. That part is the same.

I know method #1 is the normalized way to do it and is much prettier. But my concern for this particular project is scalability and performance above anything else.

If I choose method #2, it will reduce having to look up on the separate favorites table, since the users table will have to be selected anyway as soon as the user logs in.

And I'm pretty sure using php's explode function to split up those CSV values in method #2 wouldn't take nearly as much time as method #1's additional db look up on the favorites table, but just in case I must ask:

From a purely performance perspective, which of these methods is more optimized?

Also please assume that this website will get a trillion page views a day.

Upvotes: 1

Answers (5)

MarkR

Reputation: 63538

Probably not, but it would totally screw your database up for reasons that others have already cited.

Do NOT use a comma-separated-list-of-IDs pattern. It just sucks.

I strongly suspect that you won't have enough users on your site for this to matter, as unless you're Facebook you are unlikely to have > 1M users. Most of those 1M users won't choose anybody as their favourite (because most will be casual users who don't use that feature).

So you're looking at an extremely small table (say, 1M rows maximum if your 1M users have an average of 1 favourite, although most don't use the feature at all) with only two columns. You can potentially improve the scans in innodb by making the primary key start with the thing that you most commonly want to search by, BUT - get this - you can still add a secondary index on the other one, and get reasonable lookup times (actually, VERY quick as the table will fit into memory on the tiniest server!)

Upvotes: 0

holms

Reputation: 9560

As far as I know using dernomalization in mysql is really trivial. but if you'd be using something like not a RDBMS but db like couchdb or mongoDB there's the whole engine how to manipulate data in a safe way. And it's really scalable, non relational database will work for you really faster..

The only method which a prefer for optimizing webapp which uses mysql for example, is to dernomalize table and then give some job to php, and ofcourse using HipHop you will get some really big optimization in there, because you offloaded mysql and loaded php which with HipHop will be optimized up to 50%!

Upvotes: 0

ruakh

Reputation: 183321

You say that scalability is a concern. This seems to imply that Method #2 won't work for you, because that limits the number of favorites that a user can have. (For example, if you have a million users, then most users will have five-digit IDs. How wide do you want to let favorites be? If it's a VARCHAR(1000), that means that fewer than 200 favorites are allowed.)

Also, do you really expect that you will never want to know which users have "favorited" a given user? Your Method #2 might be O.K. if you know that you will always look up favoritings by "favoriter" rather than "favoritee", but it falls apart completely otherwise. (And even here, it only makes sense if you don't expect to lookup anything meaningful about the "favoritee" aside from his/her user-ID; otherwise, if you actually lookup the "favoritees", then you're basically doing all the hard work of a JOIN, just removing any opportunity for MySQL to do the JOIN intelligently.)

Overall, it's better to start out with best-practices, such as normalization, and to move away from them only when performance requires it. Otherwise something that seems like a performance optimization can have negative consequences, forcing you to write very un-optimal code further down the line.

Upvotes: 1

Charles Bretana

Reputation: 146499

Use Both, One (the normalized approach), is preferable from a data normalization, maintainability, and data integrity perspective, (and for other reasons as well) - you should always strongly favor this approach.

But there's no reason not to use the other approach as well if the normalized approach is not acceptable for read performance. Often an alternative, denormalized approach will be better for read performance. So, use the first one as the "master" for keeping track of the data and for ensuring data integrity, and then keep a denormalized "copy" of the data in the other structure for read access... Update the copy from the master any time it changes, (inserts updates, deletes).

But measure the performance of your alternative approach to ensure it is indeed faster, and by a margin sufficient to justify its use.

Upvotes: 0

duffymo

Reputation: 308763

JOINs take time, but I wouldn't make the change until you have some data that suggests that it's necessary.

Normalization is good for a number of reasons; it's not just an academic exercise.

Concatenation of IDs into a column is a heinous crime against normalization. Don't do it.

You're assuming that your code is faster than all the work that's been done to optimize relational databases. That's a big mistake.

Make sure that you have indexes on primary and foreign keys that participate in the JOINs.

Profile your app when you have real performance issues; don't guess.

Make sure that the real problem isn't your app. Bringing back too much unnecessary information will be more of a performance drag than a normalized schema.

Upvotes: 1

Would denormalizing my database in this particular way help performance?

Answers (5)

Related Questions