Reputation: 9173
I have a collection of 10 million+ records and need to get them all as quickly as possible to load into memory. Is there a way to do this faster than using query iterator?
This is my current code, I have 40,000 RU allocated but it doesn't seem to use all RUs available when I run my code.
using var queryIterator = container.GetItemQueryIterator<Customer>("select * from c");
var records = new Dictionary<string,Customer>();
while (queryIterator.HasMoreResults)
{
var response = await queryIterator.ReadNextAsync().ConfigureAwait(false);
foreach (var customer in response.Resource)
{
records.Add(customer.Id, customer);
}
}
Above code works but it takes a long time and I see my RU usage is only sitting around <10% of allocated.
Partition key of my documents is id, so all documents are in their own partition. Would this be the reason for the slow read performance? Is there a way to speed it up?
I tried increasing the page size with QueryRequestOptions
var options = new QueryRequestOptions { MaxItemCount = 10000 };
But it didn't make much of a difference. I think even if i set it higher it maxes to around 5000 odd documents per page.
Should I be running multiple queries in parallel since I have the RU headroom?
Upvotes: 1
Views: 882
Reputation: 15603
There isn't an exact answer because it really depends on your application and resources.
You can start by fine tuning the concurrency by setting MaxConcurrency
to either a calculated value or -1
. Reference: https://learn.microsoft.com/azure/cosmos-db/nosql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#tune-the-degree-of-parallelism
You are already increasing the page size, you can fine tune the MaxBufferedItemCount
.
Reference: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#tune-the-buffer-size
Also keep in mind that it depends what you do with the responses, in your case you are just putting them in a Dictionary, but what if you put them in a construct that can already start consuming them (like a PubSub system), that way, the process that consumes the results can start to work while you fetch more pages.
Finally, you could even distribute this across multiple machines. Use FeedRange
s:
IReadOnlyList<FeedRange> feedRanges = await container.GetFeedRanges();
// You can distribute each feedRange to a separate compute through feedRange.ToJsonString() and feedRange.FromJsonString()
// or start concurrent Tasks one per FeedRange if the machine is big enough
// per FeedRange, you can do:
using var queryIterator = container.GetItemQueryIterator<Customer>(feedRange, new QueryDefinition("select * from c"));
var records = new Dictionary<string,Customer>();
while (queryIterator.HasMoreResults)
{
var response = await queryIterator.ReadNextAsync().ConfigureAwait(false);
foreach (var customer in response.Resource)
{
records.Add(customer.Id, customer);
}
}
Upvotes: 1