MongoDB: Recommended Schema / Document Size for Non-Steady Time-Series

Question

I found a lot of great information on the internet for MongoDB schemas revolving around steady time-series data. But in all, these examples, new data objects are being inserted on a steady basis (i.e. 1 single update, every 1 second).

Is there a recommended schema in the case of non-steady document updates? An example of this would be pinging an external API for new data. During the first couple of pings to the API, there might not be any new data available so there is nothing to update in Mongo. But a couple of pings (could be seconds, or potentially minutes) later, the API has now published 5 new objects, so now Mongo needs to update 5 different fields.

So we're dependent on time for receiving data, but the relationship isn't 1:1 (the data flow is not constant). This means pre-allocating and filling a 60 second x 60 minute nested object...

values: {
0: { 0: {obj}, 1: {obj}, …, 59: {obj} },
1: { 0: {obj}, 1: {obj}, …, 59: {obj} },
…,
58: { 0: {obj}, 1: {obj}, …, 59: {obj} },
59: { 0: {obj}, 1: {obj}, …, 59: {obj} }

}

... doesn't make much sense, because we can't guarantee we can fill it up within 1 hour. We simply cannot predict how many new objects will be published within a given period of time.

Rather than creating the nested grid strictly based on units of time, would a better approach for this use-case be to

Stick with the nested object schema
But define a custom set of dimensions?

For example:

values: {
0: { 0: {obj}, 1: {obj}, …, 199: {obj} },
1: { 0: {obj}, 1: {obj}, …, 199: {obj} },
…,
198: { 0: {obj}, 1: {obj}, …, {obj}: 1100000 },
199: { 0: {obj}, 1: {obj}, …, {obj}: 1500000 }

}

... where 200x200 was chosen randomly because, I don't know, capping at 40,000 objects/document seemed like a nice round number? Depending on how much data is churned out by the external API, this document could get filled in one day, or two days, or maybe up to a week if there isn't much action.

If this is the correct approach, is there a recommended and/or a maximum grid size that should be considered? The smaller the grid, the more documents that will be generated and we'll have to keep track of. The larger the grid, the less documents floating around in the collection but updates might take a hair longer.

This answer is likely to be based on some assumptions, so for the purposes of the discussion, let's assume we're interested in this API endpoint. (This API publishes bitcoin trades real-time as they occur on the BTCChina online exchange.) We can assume that:

It produces between 30k - 60k new objects per day
Each object is of the following size:

{ "date":"1425556988", // date of trade "price":1683.98, // price per BTC "amount":0.0134, // amount of BTC traded "tid":"24357098" // transaction ID }
Clients will NOT be subscribing to these documents - client-side resolutions of data are being generated based on this raw information
We want to retain these raw documents forever, incase we need it in the future to regenerate the higher-level resolutions

Any advice would be much appreciated! Thanks!

Matt K · Accepted Answer

Remember, in a document-oriented store you need a timestamp on each doc. That means creating arbitrary bins would be pretty pointless because you can't say "this doc refers to this hour/minute."

You've got 3 options:

Keep the hour>minute>second structure & simply don't create an object when there's no data available. That's the beauty of a schemaless design. That means if no data comes in at 12:59:32, that object will be missing. Think of this as a sparse matrix.

Keep the hour>minute>second structure & preallocate a 60x60 object for each doc with NaNs. The benefit is the document won't have to move in memory. (Popular choice)

Store each timestamp as its own doc. The benefit here is maximum resolution, as the buy/sell price can change up to 1000 times in a single second.

MongoDB: Recommended Schema / Document Size for Non-Steady Time-Series

Answers (1)

Related Questions