Getting distinct field values with utilizing index in DocumentDB

Question

I have a DocumentDB database (engine 5.0.0) with a collection my_collection with just two fields

_id (UUID)
user (string)

_{(The real collection has of course more, but I created test collection with exactly just these two fields for the purpose of validating if what I observe is independent of my particular real case.)}

Now I want to query for distinct values of user.

I fail to formulate this (in my opinion quite simple) query in a way to make use of an index (assume there are too many documents in total which we definitely don't want to need to scan, but there are not so many distinct users).

Therefore I created a (non-unique) index

{ "user": 1 }

Then, the query could be expressed as (in mongosh syntax)

db.my_collection.distinct("user")

or alternatively I could use an aggregation pipeline like

db.my_collection.aggregate(
    [
        "$group": { "_id": "$user" }
    ]
)

Note that I also tried with an additional "$sort": { "_id": 1 } after the group, but let me furthermore note that I don't necessarily need an ordered result but something in the docs indicated that to be able to use the index it is required to also sort the result, so I tried this out...

Due to the index I thought it could perform the query (all alternatives above to be honest) by even only looking at the index. Unfortunately, the query is always performed with a scan of the documents (or the index?). (This is an assumption since the response time "feels" proportional to the number of documents in the collection which I increased when performing isolated tests. I don't fully understand the explain response to be honest but it shows stage: 'AGGREGATE' for the distinct, and stage: 'HASH_AGGREGATE' for the aggregate query.)

I was then trying to force DocumentDB to use the index by adding hint: { "user": 1 } to the query, however, that is only supported by aggregate and not by distinct. When using with aggregate, I get the error message:

MongoServerError: Cannot use Hint for this Query. Index is multi key index , partial index or sparse index and query is not optimized to use this index.

But that does not make sense for me, because the index I use is neither multi key, partial nor sparse, at least I didn't specify any of these when creating it (but again, I don't fully understand it although I tried to read the documentation...)

So now I wonder

Why is the index not utilized for "immediately" finding the user values? (Due to the error message when specifying it in hint, I assume it just can't be used, leading to the next question...)
Why can't I use the index? (Maybe DocumentDB does not support it here? But the documentation does not indicate anything like that.)
Is there an alternative approach I should try?

EDIT

Additional information requested in the comments:

db.my_collection.explain("executionStats").distinct("user"):

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'testdb.my_collection',
    winningPlan: {
      stage: 'AGGREGATE',
      inputStage: {
        stage: 'HASH_AGGREGATE',
        inputStage: {
          stage: 'PROJECTION',
          inputStage: {
            stage: 'IXONLYSCAN',
            indexName: 'user_1',
            direction: 'forward'
          }
        }
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '219.245',
    planningTimeMillis: '0.274',
    executionStages: {
      stage: 'AGGREGATE',
      nReturned: '1',
      executionTimeMillisEstimate: '217.264',
      inputStage: {
        stage: 'HASH_AGGREGATE',
        nReturned: '4',
        executionTimeMillisEstimate: '217.248',
        inputStage: {
          stage: 'PROJECTION',
          nReturned: '50000',
          executionTimeMillisEstimate: '155.232',
          inputStage: {
            stage: 'IXONLYSCAN',
            nReturned: '50000',
            executionTimeMillisEstimate: '56.970',
            indexName: 'user_1',
            direction: 'forward'
          }
        }
      }
    }
  },
  serverInfo: {
    host: '...-cluster-instance-1',
    port: 27017,
    version: '5.0.0'
  },
  ok: 1,
  operationTime: Timestamp({ t: 1722928392, i: 1 })
}

(Note: the collection has 50k documents and 4 distinct values in user.)

And while we're at it, here is

db.my_collection.explain("executionStats").aggregate([{"$group":{"_id":"$user"}}]):

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'testdb.test_distinct',
    winningPlan: {
      stage: 'HASH_AGGREGATE',
      inputStage: {
        stage: 'COLLSCAN'
      }
    }
  },
  executionStats: {
    executionSuccess: true,
    executionTimeMillis: '112.782',
    planningTimeMillis: '0.087',
    executionStages: {
      stage: 'HASH_AGGREGATE',
      nReturned: '4',
      executionTimeMillisEstimate: '112.654',
      inputStage: {
        stage: 'COLLSCAN',
        nReturned: '50000',
        executionTimeMillisEstimate: '67.115'
      }
    }
  },
  serverInfo: {
    host: '...-cluster-instance-1',
    port: 27017,
    version: '5.0.0'
  },
  ok: 1,
  operationTime: Timestamp({ t: 1722928895, i: 1 })
}

Getting distinct field values with utilizing index in DocumentDB

EDIT

Answers (0)

Related Questions