How to fix query problem on Azure CosmosDB that occurs only on collections with large data?

Question

I'm trying to read from a CosmosDB collection (MachineCollection) with a large amount of data (58 GB data; index-size 9 GB). Throughput is set to 1000 RU/s. The collection is partitioned with a Serial number, Read Location (WestEurope, NorthEurope), Write Location (WestEurope). Simultaneously to my reading attempts, the MachineCollection is fed with data every 20 seconds.

The problem is that I can not query any data via Python. If I execute the query on CosmosDB Data Explorer I get results in no time. (e.g. querying for a certain serial number).

For troubleshooting purposes, I have created a new Database (TestDB) and a TestCollection. In this TestCollection, there are 10 datasets of MachineCollection. If I try to read from this MachineCollection via Python it succeeds and I am able to save the data to CSV.

This makes me wonder why I am not able to query data from MachineCollection when configuring TestDB and TestCollection with the exact same properties.

What I have already tried for the querying via Python:

options['enableCrossPartitionQuery'] = True
Querying using PartitionKey:   options['partitionKey'] = 'certainSerialnumber'

Same as always. Works with TestCollection, but not with MachineCollection.

Any ideas on how to resolve this issue are highly appreciated!

Jay Gong · Accepted Answer

Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?

Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.

By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.

As I know,Azure portal doesn't automatically help you optimize your SQL so you need to handle this in the sdk or rest api.

You could set value of Max Item Count and paginate your data using continuation token. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:

q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})

Another case you could refer to:How do I set continuation tokens for Cosmos DB queries sent by document_client objects in Python?

How to fix query problem on Azure CosmosDB that occurs only on collections with large data?

Answers (1)

Related Questions