Harel
Harel

Reputation: 2039

Mongodb data storage performance - one doc with items in array vs multiple docs per item

I have statistical data in a Mongodb collection saved for each record per day. For example my collection looks roughly like

{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }

Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.

As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:

{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}

On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.

Upvotes: 2

Views: 1145

Answers (2)

jianpx
jianpx

Reputation: 3330

I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.

Upvotes: 0

mnemosyn
mnemosyn

Reputation: 46291

I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.

The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.

Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.

Upvotes: 2

Related Questions