leemes
leemes

Reputation: 45675

Getting distinct field values with utilizing index in DocumentDB

I have a DocumentDB database (engine 5.0.0) with a collection my_collection with just two fields

(The real collection has of course more, but I created test collection with exactly just these two fields for the purpose of validating if what I observe is independent of my particular real case.)

Now I want to query for distinct values of user.

I fail to formulate this (in my opinion quite simple) query in a way to make use of an index (assume there are too many documents in total which we definitely don't want to need to scan, but there are not so many distinct users).

Therefore I created a (non-unique) index

{ "user": 1 }

Then, the query could be expressed as (in mongosh syntax)

db.my_collection.distinct("user")

or alternatively I could use an aggregation pipeline like

db.my_collection.aggregate(
    [
        "$group": { "_id": "$user" }
    ]
)

Note that I also tried with an additional "$sort": { "_id": 1 } after the group, but let me furthermore note that I don't necessarily need an ordered result but something in the docs indicated that to be able to use the index it is required to also sort the result, so I tried this out...

Due to the index I thought it could perform the query (all alternatives above to be honest) by even only looking at the index. Unfortunately, the query is always performed with a scan of the documents (or the index?). (This is an assumption since the response time "feels" proportional to the number of documents in the collection which I increased when performing isolated tests. I don't fully understand the explain response to be honest but it shows stage: 'AGGREGATE' for the distinct, and stage: 'HASH_AGGREGATE' for the aggregate query.)

I was then trying to force DocumentDB to use the index by adding hint: { "user": 1 } to the query, however, that is only supported by aggregate and not by distinct. When using with aggregate, I get the error message:

MongoServerError: Cannot use Hint for this Query. Index is multi key index , partial index or sparse index and query is not optimized to use this index.

But that does not make sense for me, because the index I use is neither multi key, partial nor sparse, at least I didn't specify any of these when creating it (but again, I don't fully understand it although I tried to read the documentation...)

So now I wonder


EDIT

Additional information requested in the comments:

db.my_collection.explain("executionStats").distinct("user"):

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'testdb.my_collection',
    winningPlan: {
      stage: 'AGGREGATE',
      inputStage: {
        stage: 'HASH_AGGREGATE',
        inputStage: {
          stage: 'PROJECTION',
          inputStage: {
            stage: 'IXONLYSCAN',
            indexName: 'user_1',
            direction: 'forward'
          }
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '219.245',
    planningTimeMillis: '0.274',
    executionStages: {
      stage: 'AGGREGATE',
      nReturned: '1',
      executionTimeMillisEstimate: '217.264',
      inputStage: {
        stage: 'HASH_AGGREGATE',
        nReturned: '4',
        executionTimeMillisEstimate: '217.248',
        inputStage: {
          stage: 'PROJECTION',
          nReturned: '50000',
          executionTimeMillisEstimate: '155.232',
          inputStage: {
            stage: 'IXONLYSCAN',
            nReturned: '50000',
            executionTimeMillisEstimate: '56.970',
            indexName: 'user_1',
            direction: 'forward'
          }
        }
      }
    }
  },
  serverInfo: {
    host: '...-cluster-instance-1',
    port: 27017,
    version: '5.0.0'
  },
  ok: 1,
  operationTime: Timestamp({ t: 1722928392, i: 1 })
}

(Note: the collection has 50k documents and 4 distinct values in user.)

And while we're at it, here is

db.my_collection.explain("executionStats").aggregate([{"$group":{"_id":"$user"}}]):

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'testdb.test_distinct',
    winningPlan: {
      stage: 'HASH_AGGREGATE',
      inputStage: {
        stage: 'COLLSCAN'
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '112.782',
    planningTimeMillis: '0.087',
    executionStages: {
      stage: 'HASH_AGGREGATE',
      nReturned: '4',
      executionTimeMillisEstimate: '112.654',
      inputStage: {
        stage: 'COLLSCAN',
        nReturned: '50000',
        executionTimeMillisEstimate: '67.115'
      }
    }
  },
  serverInfo: {
    host: '...-cluster-instance-1',
    port: 27017,
    version: '5.0.0'
  },
  ok: 1,
  operationTime: Timestamp({ t: 1722928895, i: 1 })
}

Upvotes: 4

Views: 174

Answers (0)

Related Questions