Reputation: 44278
I'm doing some R&D to move a product catalog into CosmosDB.
In it's simplest terms a Product document will have:
A manufacturer will log into this system and will only be able to query their own data so there will always be a ManufacturerId = SINGLE_VALUE
filter on every query.
When reviewing the cosmos docs, re: chosing the correct partition strategy, there seems to be 2 main points. - Choose a partition key with a high cardinality - Choose a partition key that gives an even distribution of data.
In my scenario above, chosing product Id as the PartitionKey would be pretty extreme... 1 document per logical partition. On the other hand chosing Manufactuer wouldn't be great either since that won't result in an even distribution (some manufacturers have 10 products, others have 100,000)
One way to ensure an even distribution would be to take the first 4 characters of the GUID and use that as a PartitionKey. (so max 4096 partitions). Based on the existing dataset i have, this does result in an even distribution of data. but I'm wondering are there any downsides to doing this.
Are there any downsides to just using the entire productId as the PartitionKey (1 doc per partition) as they seem to indicate that's a valid approach for a system that stores user profiles. Would this approach have implications for searching for multiple products in the same search.
Upvotes: 12
Views: 3552
Reputation: 2942
Depending on the size of your docs and the overall number of docs for a manufacturer, I would probably go with ManufacturerID as your PartitionKey.
Would it be unbalanced, yes. But as long as the biggest manufacturer can stay under the partition limit (12.5GB as of this writing) then you would have very efficient querying. If you chose the GUID field, then you would always have to utilize a cross-partition query, which means higher RUs are needed and thus more costly and slower. The assumption I'm making here are that the larger manufacturers will probably execute more queries.
If you do think you'll bump up against that partition limit, some other ideas would be partition into a sub-category for each manufacturer if that's possible. Example: Manufacturer = General Motors
, Category = SUVs
, and then partition on a custom string field that represents Manufacturer_Category
. This composite partition key is the best compromise of read/write speeds, and partition balancing.
-FYI: No need to use substring of a GUID as a partitionKey because CosmosDB will hash your values automatically for you into the appropriate partition key ranges for the number of physical partitions you have.
Upvotes: 2
Reputation: 369
Using a key that is unique per-document is a good way to ensure even distribution to support high performance - so that makes the full product id a great choice. I don't believe you would gain any advantage from using a substring of a full guid as a partition key - and you would be limiting your maximum number of usable partitions.
So why not always use a unique identifier as the partition key?
First, if you add a partition key to a query, you do not need to enable cross-partition query and you will have a lower overall query cost (RU/s). So if you can design your partition key to reduce your need for cross-partition queries it could save RU/s. I don't think a 'substring of a guid' helps you there, because the random nature of the guid would not distribute documents in a way you could take advantage of for efficient querying.
Second, only documents with the same partition key are guaranteed to all be available on the same partition if you need to involve them in a transactional stored procedure. A 'substring of a guid' also doesn't help with this case.
I almost always use 'identifier' based partition keys such as your product id. This doesn't always correspond to the 'id' of the document itself. Sometimes I have multiple documents with content related to the same thing. For example, if I have some product information synced from another system, that sync job can be most efficient if it uses upsert - but due to current lack of partial update support in CosmosDB (see user voice) the whole document needs to be upserted. So in this case I have one document for the synced information, and a separate document for other information. This could look something like:
{
"id": "12345:myinfo",
"productid":"12345",
"info":{}
"type":"myinfotype"
},
{
"id": "12345:vendorsync",
"productid":"12345",
"syncedinfo":{},
"type":"vendorsync"
}
Here the product id is the partition key, and I have a couple of different documents related to that product that I know will reside on the same partition so I can query them efficiently or involve them in a transaction.
I have also used this pattern when implementing a revision system, so that all revisions of the same logical document are guaranteed to be placed on the same partition. In that case the document has a "documentid" that is the same for all revisions, and the actual "id" of the document is the document id with the revision number added.
Please also review 'Design for Partitioning' here if you haven't already: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Upvotes: 8