Mongodb query by id using $in operator for large sets of Ids is too slow, alternatives?

Question

I'm working on a project that uses Solr for full text search and Mongodb as persistent storage. Basically searches in Solr return Mongo ids that we then use to fetch the documents.

The problem is that some Solr searches return results in the order of the thousands of Ids. These results are actually what we expect, so no issue with Solr here. The problem comes when we want to fetch say 10k Ids from mongodb. The query is using $in but takes way too long; after checking the mongodb profiler, it seems that mongo spends a lot of time waiting to acquire read locks.

Any alternative approaches?, maybe still using $in but spliting the Ids set into smaller chunks?.

As a side note, we're using Java 8, with Spring 4.0 and Spring-Data-Mongo 1.6

Also, as additional information, the collection has 1.3 million documents, with each document averaging 11Kb in size.

Here an example of the query:

  {"_id" : {
        "$in" : [
            ObjectId("5441614a5d28a9872823694c"),
            ObjectId("544155eb5d28a987281aa112"),
            ObjectId("5441500e5d28a9872815b917"),
            ObjectId("544153285d28a987281877b9"),
            ObjectId("544159095d28a987281c1f5c"),
            ObjectId("54415b105d28a987281d3ad7"),
            ObjectId("54415a995d28a987281cf0e6"),
            ObjectId("544160215d28a9872822383b"),
            ObjectId("544160e85d28a98728230342"),
            ObjectId("544157ba5d28a987281b7dea"),
            ObjectId("54415e375d28a9872820508b"),
            ObjectId("544150f75d28a98728169563"),
            ObjectId("54415c6b5d28a987281e8bcb"),
            ObjectId("54415a6d5d28a987281cd704").............]}}

And this is the result of explain for a small set:

{
"cursor" : "BtreeCursor _id_ multi",
"isMultiKey" : false,
"n" : 14,
"nscannedObjects" : 14,
"nscanned" : 27,
"nscannedObjectsAllPlans" : 14,
"nscannedAllPlans" : 27,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
    "_id" : [
        [
            ObjectId("5441500e5d28a9872815b917"),
            ObjectId("5441500e5d28a9872815b917")
        ],
        [
            ObjectId("544150f75d28a98728169563"),
            ObjectId("544150f75d28a98728169563")
        ],
        [
            ObjectId("544153285d28a987281877b9"),
            ObjectId("544153285d28a987281877b9")
        ],
        [
            ObjectId("544155eb5d28a987281aa112"),
            ObjectId("544155eb5d28a987281aa112")
        ],
        [
            ObjectId("544157ba5d28a987281b7dea"),
            ObjectId("544157ba5d28a987281b7dea")
        ],
        [
            ObjectId("544159095d28a987281c1f5c"),
            ObjectId("544159095d28a987281c1f5c")
        ],
        [
            ObjectId("54415a6d5d28a987281cd704"),
            ObjectId("54415a6d5d28a987281cd704")
        ],
        [
            ObjectId("54415a995d28a987281cf0e6"),
            ObjectId("54415a995d28a987281cf0e6")
        ],
        [
            ObjectId("54415b105d28a987281d3ad7"),
            ObjectId("54415b105d28a987281d3ad7")
        ],
        [
            ObjectId("54415c6b5d28a987281e8bcb"),
            ObjectId("54415c6b5d28a987281e8bcb")
        ],
        [
            ObjectId("54415e375d28a9872820508b"),
            ObjectId("54415e375d28a9872820508b")
        ],
        [
            ObjectId("544160215d28a9872822383b"),
            ObjectId("544160215d28a9872822383b")
        ],
        [
            ObjectId("544160e85d28a98728230342"),
            ObjectId("544160e85d28a98728230342")
        ],
        [
            ObjectId("5441614a5d28a9872823694c"),
            ObjectId("5441614a5d28a9872823694c")
        ]
    ]
},
"server" : "0001a22df018:27017"

}

Wizard · Accepted Answer

Perhaps this information can give a help, just for reference.

The size of the collection is larger than 1.3M x 11K = 14.6GB (not a small one)
The rate of documents you want to query is 10K / 1.3M = 0.75%

The documents are indexed and finding any one should be very fast. But the collection is large. As you didn't provide information about ids then I just suppose the distribution of documents for these ids are almost arbitrary.
Firstly, MongoDB may try to find all document from memory. When can not find any more, it will load new data from disk into memory according the rest ids, and repeat searching again until finish the work at last. The loading times from disk may be a main factor to determine query performance. And the loading times is according to the distribution of your ids. If they are very dense in distribution, the query should be very fast, else it may slow. So, the speed is according to the distribution of documents you are searching.

Use sharded collection (more shard instance) may give some help.

Mongodb query by id using $in operator for large sets of Ids is too slow, alternatives?

Answers (1)

Related Questions