Reputation: 11

Azure Cosmos DB aggregation and indexes

I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.

My collection schema is below and I have 80.000 documents in this collection.

{
    "_id" : ObjectId("5aca8ea670ed86102488d39d"),
    "UserID" : "5ac161d742092040783a4ee1",
    "ReferenceID" : 87396,
    "ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
    "ElapsedTime" : 1694,
    "CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}

If I run this command below to count all documents in collection, I have the result so quickly:

db.Tests.count()

But when I run this same command but to a specific user, I've got a message "Request rate is large".

db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()

In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.

I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.

I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.

Upvotes: 1

Answers (2)

loomchild

Reputation: 778

I think it just doesn't work.

The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.

This query is cheap and fast:

db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})

but this one is slow and expensive:

db.Tests.count({ ReferenceID: { '$gt': 10 }})

even though this query is fast:

db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })

I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes. Note the status: "We have started work on this feature. Will update here when this becomes generally available."

Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years. Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.

BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.

Upvotes: 0

Imre Pühvel

Reputation: 4994

Count by filter queries ARE using indexes if they are available.

If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:

{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}

So definitely add a suitable index on UserID.

If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.

When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:

"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"

It also provides you some insight where the effort has gone if you feel like RU charge is too large.

If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.

Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.

I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.

Upvotes: 3

Azure Cosmos DB aggregation and indexes

Answers (2)

Related Questions