libbkmz
libbkmz

Reputation: 681

MongoDB not using indexes for distinct with simple query

I've had a weired behaviour with mongodb distinct query. currently, i'm using 2.6.10 version. Ok, let's create simple collection for test with explanation.

from pymongo import MongoClient
import random

client = MongoClient('127.0.0.1', 27017)
client.DBTEST.random.remove({})

value = 0
BATCH_LEN = 16384
BATCH = []

for i in xrange(0, 500000):
    BATCH.append({
            "product": "value_uniq_1",
            "number": value
        })

    if random.randint(0, 100) <= 1:
        value = i

    if len(BATCH) > BATCH_LEN:
        client.DBTEST.random.insert(BATCH)
        BATCH = []

client.DBTEST.random.insert(BATCH)
BATCH = []

Ok, it will create collection chich contains documents like this

╔══════════════╦════════╗
║   product    ║ number ║
╠══════════════╬════════╣
║ value_uniq_1 ║ 1      ║
║ value_uniq_1 ║ 1      ║
║ value_uniq_1 ║ 1      ║
║ value_uniq_1 ║ 56     ║
║ value_uniq_1 ║ 56     ║
║ value_uniq_1 ║ 56     ║
║ ...          ║ ...    ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
╚══════════════╩════════╝

Now, I have only 1 unique value for product, but, in near future(1 week) it will increase to near 30 different string values, like this:

╔══════════════╦════════╗
║   product    ║ number ║
╠══════════════╬════════╣
║ value_uniq_1 ║ 1      ║
║ value_uniq_1 ║ 1      ║
║ value_uniq_1 ║ 1      ║
║ value_uniq_1 ║ 56     ║
║ value_uniq_1 ║ 56     ║
║ value_uniq_1 ║ 56     ║
║ ...          ║ ...    ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_2 ║ 987    ║
║ value_uniq_2 ║ 987    ║
║ value_uniq_2 ║ 987    ║
╚══════════════╩════════╝

Ok, i finished with my data structure, now lets look a little bit of mongodb queries.

My main goal, is to get all unique values of number for certain product.

I do it in this way:

db.random.distinct("number", {product: "value_uniq_1"})

Ok, this is not quite verbose for debugging, and i will use db.runCommand in the next lines. But, now, lets avoid using query for distinct and look of stats section:

db.runCommand({distinct: 'random', key:'number'})

"stats" : {
    "n" : 500000,
    "nscanned" : 500000,
    "nscannedObjects" : 500000,
    "timems" : 479,
    "cursor" : "BasicCursor"
},

That's okay, because we didn't create indexes yet, lets add for number field:

db.random.createIndex({number: 1})

ReRun previous query:

db.runCommand({distinct: 'random', key:'number'})

"stats" : {
    "n" : 10005,
    "nscanned" : 10005,
    "nscannedObjects" : 0,
    "timems" : 83,
    "cursor" : "DistinctCursor"
},

Great, it uses indexes, and everything working fine! 0 nscannedObjects!!!

Okay, lets add query for distinct:

db.runCommand({distinct: 'random', key:'number', query: {product: "value_uniq_1"}})

"stats" : {
    "n" : 500000,
    "nscanned" : 500000,
    "nscannedObjects" : 500000,
    "timems" : 694,
    "cursor" : "BasicCursor"
},

It's not what we expected("nscannedObjects" : 500000), BUT, there is no index for product, lets create some one:

db.random.createIndex({product: 1, number: -1})

There is no difference with direction, any combination of product: 1, number -1 OR product -1, number 1, OR product: 1, number: 1 gives the same behaviour. I've checked all combinations.

db.runCommand({distinct: 'random', key:'number', query: {product: "value_uniq_1"}})

"stats" : {
    "n" : 500000,
    "nscanned" : 500000,
    "nscannedObjects" : 500000,
    "timems" : 968,
    "cursor" : "BtreeCursor product_1_number_-1"
},

WTF is going on? Why it's scanning all collection with an index? Currently, whole collection contains with only ONE product value, and I can't guess what will be with different products. Why it's so slow with common distinct query? 1 sec it's too slow...

I don't want to use separate collection for each product it's crazy and unefficient, because I need shared queries between all products. My real DB contains more than 5millions of numbers per one product and it takes more that 3 seconds for this query.

Upvotes: 4

Views: 1124

Answers (1)

Sachin Shukla
Sachin Shukla

Reputation: 171

I am using 3.0.2, Looks, its utilizing the index but still don't know why its scanning all the records, I created the same collection in my mongodb, and created the Indexes. querying the distinct values of "number" field shows that it scanned 20K recores (which are the total number of records I inserted)

Please refer this image where it shows the Index scan in the plan summary.

https://www.dropbox.com/s/dh3tglyg4lsaqmm/distinct_explain_plan.png?dl=0

> db.random.getIndexes()
[
    {
            "v" : 1,
            "key" : {
                    "_id" : 1
            },
            "name" : "_id_",
            "ns" : "test.random"
    },
    {
            "v" : 1,
            "key" : {
                    "product" : 1,
                    "number" : 1
            },
            "name" : "product_1_number_1",
            "ns" : "test.random"
    },
    {
            "v" : 1,
            "key" : {
                    "number" : 1
            },
            "name" : "number_1",
            "ns" : "test.random"
    }
]

Upvotes: 1

Related Questions