user2421024
user2421024

Reputation: 51

mongodb - Recommended tree structure for large amount of data points

I'm working on a project which records price history for items across multiple territories, and I'm planning on storing the data in a mongodb collection.

As I'm relatively new to mongodb, I'm curious about what might be a recommended document structure for quite a large amount of data. Here's the situation:

I'm recording the price history for about 90,000 items across 200 or so territories. I'm looking to record the price of each item every hour, and give a 2 week history for any given item. That comes out to around (90000*200*24*14) ~= 6 billion data points, or approximately 67200 per item. A cleanup query will be run once a day to remove records older than 14 days (more specifically, archive it to a gzipped json/text file).

In terms of data that I will be getting out of this, I'm mainly interested in two things: 1) The price history for a specific item in a specific territory, and 2) the price history for a specific item across ALL territories.

Before I actually start importing this data and running benchmarks, I'm hoping someone might be able to give some advice on how I should structure this to allow for quick access to the data through a query.

I'm considering the following structure:

{
    _id: 1234,
    data: [
        {
            territory: "A",
            price: 5678,
            time: 123456789
        },
        {
            territory: "B",
            price: 9876
            time: 123456789
        }
    ]
}

Each item is its own document, which each territory/price point for that item in a particular territory. The issue I run into with this is retrieving the price history for a particular item. I believe I can accomplish this with the following query:

db.collection.aggregate(
    {$unwind: "$data"},
    {$match: {_id: 1234, "data.territory": "B"}}

)

The other alternative I was considering was just put every single data point in its own document and putting an index on the item and territory.

// Document 1
{
    item: 1234,
    territory: "A",
    price: 5679,
    time: 123456789
}
// Document 2
{
    item: 1234,
    territory: "B",
    price: 9676,
    time: 123456789
 }

I'm just unsure of whether having 6 billion documents with 3 indexes or having 90,000 documents with 67200 array objects each and using an aggregate would be better for performance.

Or perhaps there's some other tree structure or handling of this problem that you fine folks and MongoDB wizards can recommend?

Upvotes: 5

Views: 925

Answers (1)

wdberkeley
wdberkeley

Reputation: 11671

I would structure the documents as "prices for a product in a given territory per fixed time interval". The time interval is fixed for the schema as a whole, but different schemas result from different choices and the best one for your application will probably need to be decided by testing. Choosing the time interval to be 1 hour gives your second schema idea, with ~6 billion documents total. You could choose the time interval to be 2 weeks (don't). In my mind, the best time interval to choose is 1 day, so the documents would look like this

{
    "_id" : ObjectId(...), // could also use a combination of prod_id, terr_id, and time so you get a free unique index to look up by those 3 values
    "prod_id" : "DEADBEEF",
    "terr_id" : "FEEDBEAD",
    "time" : ISODate("2014-10-22T00:00:00.000Z"), // start of the day this document contains the data for
    "data" : [
        {
            "price" : 1234321,
            "time" : ISODate("2014-10-22T15:00:00.000Z") // start of the hour this data point is for
        },
        ...
    ]
}

I like the time interval of 1 day because it hits a nice balance between number of documents (mostly relevant because of index sizes), size of documents (16MB limit, have to pipe over network), and ease of retiring old docs (hold 15 days, wipe+archive all from 15th day at some point each day). If you put an index on { "prod_id" : 1, "terr_id" : }`, that should let you fulfill your two main queries efficiently. You can gain an additional bonus performance boost by preallocating the doc for each day so that updates are in-place.

There's a great blog post about managing time series data like this, based on experience building the MMS monitoring system. I've essentially lifted my ideas from there.

Upvotes: 2

Related Questions