Reputation: 37481
Working with MongoDB, I'm fairly aware of when normalization makes sense, but I'm trying to understand if it makes sense from a data-size or query-performance sense.
As an example, let's say I have a list of 2,000 retail stores. For the example, I do not care about anything else so I only have the store name. At least a few characters but never longer than a hundred.
I have 20 million records of people who took a trip to a store. I'd record some info about the trip, along with this store name.
In MySQL, the file size of the table/records can impact the query performance quite heavily. Normalizing the store names so that I can store the string once, and then write the key (an int) instead saves on a lot of disk space. That turns into performance gains.
In MongoDB, this type of data is small and unimportant enough that it should simply be embedded. However, I'm concerned that after 20 million records, queries and/or file size will suffer performance problems because of how much redundant data is there.
If 5 million people all went to a single store, there's going to be 5 million embedded copies of that string in mongo documents. That will take up a lot more disk space and seems like it may eat up performance over an integer.
I suppose if I did normalize the data, Mongo would still have some performance overhead to query that additional data.
Upvotes: 1
Views: 469
Reputation: 46331
If 5 million people all went to a single store, there's going to be 5 million embedded copies of that string in mongo documents.
Yeah, and much worse, should the name ever change, you'd have to update 5M entries
That will take up a lot more disk space and seems like it may eat up performance over an integer.
I'd abstain from integers as keys. Integers create bottlenecks if sequenced (the sequencer becoming the bottleneck) and non-monotonous if using algorithms like HiLo. Use ObjectId
or something else that is non-sequential, but monotonic.
Monotony often improves data locality (if creation time is a locality criterion, which it often is), and helps utilize the BTree more efficiently.
I have 20 million records of people who took a trip to a store. I'd record some info about the trip, along with this store name.
That's a typical graph problem. People and Stores are nodes, trips are the edges. MongoDB handles normalized graphs better than one might think.
Directly comparing the performance of the two approaches is tricky - the additional roundtrip will make the normalized query slower, but you could choose to additionally store the store's name with each trip. The performance overhead won't be huge, but it will add RAM pressure.
Summing up, adjusting the data structure for less queries has certainly merit, but for sheer size reduction from, say, 80 bytes on average to 24 bytes I wouldn't do it. The normalized structure is much easier to keep consistent (or eventually consistent, when storing both), easy enough to query and can be efficiently 'joined' using $in
. The upside is clear data ownership and more flexibility (e.g. reversing the query to 'who ever took a trip to place x') is much easier.
Upvotes: 1
Reputation: 294
the question of normalizing depends how you are accessing the data 95% of the time. e.g. if you have a site or app that lists the stores first and then based on a request shows the trips to the store then definitely Normalize
adding a index to the Store attribute in your collection would enhance the performance of your searches whether or not its a int
assuming you have done that, from what I have read the document size does not affect search of the document as it would use the index to find the rows with the referenced Store
Upvotes: 0