lahsrah
lahsrah

Reputation: 9173

What's the fastest way to download/select *all* documents from an Azure Cosmos DB collection

I have a collection of 10 million+ records and need to get them all as quickly as possible to load into memory. Is there a way to do this faster than using query iterator?

This is my current code, I have 40,000 RU allocated but it doesn't seem to use all RUs available when I run my code.

using var queryIterator = container.GetItemQueryIterator<Customer>("select * from c");

var records = new Dictionary<string,Customer>();

while (queryIterator.HasMoreResults)
{
    var response = await queryIterator.ReadNextAsync().ConfigureAwait(false);
     foreach (var customer in response.Resource)
     {
          records.Add(customer.Id, customer);
     }
}

Above code works but it takes a long time and I see my RU usage is only sitting around <10% of allocated.

Partition key of my documents is id, so all documents are in their own partition. Would this be the reason for the slow read performance? Is there a way to speed it up?

I tried increasing the page size with QueryRequestOptions

var options = new QueryRequestOptions { MaxItemCount = 10000 };

But it didn't make much of a difference. I think even if i set it higher it maxes to around 5000 odd documents per page.

Should I be running multiple queries in parallel since I have the RU headroom?

Upvotes: 1

Views: 882

Answers (1)

Matias Quaranta
Matias Quaranta

Reputation: 15603

There isn't an exact answer because it really depends on your application and resources.

You can start by fine tuning the concurrency by setting MaxConcurrency to either a calculated value or -1. Reference: https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#tune-the-degree-of-parallelism

You are already increasing the page size, you can fine tune the MaxBufferedItemCount. Reference: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#tune-the-buffer-size

Also keep in mind that it depends what you do with the responses, in your case you are just putting them in a Dictionary, but what if you put them in a construct that can already start consuming them (like a PubSub system), that way, the process that consumes the results can start to work while you fetch more pages.

Finally, you could even distribute this across multiple machines. Use FeedRanges:

IReadOnlyList<FeedRange> feedRanges = await container.GetFeedRanges();

// You can distribute each feedRange to a separate compute through feedRange.ToJsonString() and feedRange.FromJsonString()
// or start concurrent Tasks one per FeedRange if the machine is big enough

// per FeedRange, you can do:
using var queryIterator = container.GetItemQueryIterator<Customer>(feedRange, new QueryDefinition("select * from c"));

var records = new Dictionary<string,Customer>();

while (queryIterator.HasMoreResults)
{
    var response = await queryIterator.ReadNextAsync().ConfigureAwait(false);
     foreach (var customer in response.Resource)
     {
          records.Add(customer.Id, customer);
     }
}

Upvotes: 1

Related Questions