Best shard key (or optimised query) for range query on sub-document array

Question

Below is a simplified version of a document in my database:

{
    _id : 1,
    main_data : 100,
    sub_docs: [
        {
            _id : a,
            data : 100
        },
        {
            _id: b,
            data : 200
        },
        {
            _id: c,
            data: 150
        }
    ]
}

So imagine I have lots of these documents with varied data values (say 0 - 1000). Currently my query is something like:

db.myDb.find(
    { sub_docs.data : { $elemMatch: { $gte: 110, $lt: 160 } } }
)

Is there any shard key I could use to help this query? As currently it is querying all shards. If not is there a better way to structure my query?

BigDataKid · Accepted Answer

Jackson,

You are thinking about this problem the right way. The problem with broadcast queries in MongoDB is that they can't scale.

Any MongoDB query that does not filter on the shard key, will be broadcast to all shards. Also, range queries are likely to either cause broadcasts of at the very least cause your queries to be sent to multiple shards.

So here is some things to think about

Query Frequency -- Is the range query your most frequent query? What is the expected workload?
Range Logic -- Is there any instrinsic logic to how you are going to apply the ranges? Let's say, you would say 0-200 is small, 200 - 400 is medium. You could potentially add another field to your document and shard on it.
Additional shard key candidates -- Sometimes there are other fields that can be included in all or most of your queries and it would provide good distribution. By combining filtering with your range queries you could restrict your query to one or fewer shards.
Break array -- You could potentially have multiple documents instead of an array. In this scenario, you would have multiple docs, one for each occurrence of the array and main data would be duplicated across mulitple documents. Range query on this item would still be a problem, but you could involve multiple shards, not necessarily all (it depends on your data demographics and query patterns)

It boils down to the nature of your data and queries. The sample document that you provided is very anonymized so it is harder to know what would be good shard key candidates in your domain.

One last piece of advice is to be careful on your insert/update query patterns if you plan to update your document frequently to add more entries to the array. Growing documents present scaling problems for MongoDB. See this article on this topic.

Best shard key (or optimised query) for range query on sub-document array

Answers (1)

Related Questions