Reputation: 3427
I'm setting up our first Azure Cosmos DB - I will be importing into the first collection, the data from a table in one of our SQL Server databases. In setting up the collection, I'm having trouble understanding the meaning and the requirements around the partition key, which I specifically have to name while setting up this initial collection.
I've read the documentation here: (https://learn.microsoft.com/en-us/azure/cosmos-db/documentdb-partition-data) and still am unsure how to proceed with the naming convention of this partition key.
Can someone help me understand how I should be thinking in naming this partition key? See the screenshot below for the field I'm trying to fill in.
In case it helps, the table I'm importing consists of 7 columns, including a unique primary key, a column of unstructured text, a column of URL's and several other secondary identifiers for that record's URL. Not sure if any of that information has any bearing on how I should name my Partition Key.
EDIT: I've added a screenshot of several records from the table from which I'm importing, per request from @Porschiey.
Upvotes: 102
Views: 123009
Reputation: 2241
Honestly the video here* was a MAJOR help to understanding partitioning in CosmosDb.
But, in a nutshell: The PartitionKey is a property that will exist on every single object that is best used to group similar/related objects together.
Good examples include Location (like City), Customer Id, Team, and more. Naturally, it wildly depends on your solution; so perhaps if you were to post what your object looks like we could recommend a good partition key.
EDIT: Should be noted that PartitionKey isn't required for collections under 10GB. (thanks David Makogon)
* The video used to live on this MS docs page entitled, "Partitioning and horizontal scaling in Azure Cosmos DB", but has since been removed. A direct link has been provided, above.
Upvotes: 82
Reputation: 13537
To explain what the partition Key is, we should first be sure we understand what Cosmos is.
Cosmos DB as a service is a managed storage solution where we don't need to worry about how many servers are responsible for storing the data we provide. We don’t even see it listed that if we choose 100 RU/s capacity, we might be running on just two servers, while 10,000 RU/s means our DB is now scattered across 40 instances.
That’s all managed for us, but to make it work well, we are asked to provide some information about our data to help Cosmos ensure that just one node isn’t too ‘Hot’.
Enter : The whole purpose of partitioning keys.
Partition Keys help Cosmos DB distribute your records evenly across the multiple hosts that are going to be storing the data.
You are supposed to pick a field that evenly splits your data into more-or-less equal pie-slices. And your queries get much better when all the data is in just one or two partitions. They cost less and run faster.
So for you, worry about which field matters most.
Will you have different CaptureSessionIDs that are shared with 100 or more records? And that matters to your app? Sounds like a good candidate.
Another good choice might be if your data is more or less split into 5 or 10 or 100 even slices based on the Category or ProductId or SourceCountry.
This is our task as devs, to think about splitting our data into meaningful categories which can be used to then store the data across multiple hosts.
Upvotes: 0
Reputation: 286
There is documentation available from Microsoft about partition keys. According to me you need to check the queries or operations that you plan to perform with cosmos DB. Are they read-heavy or write-heavy? if read heavy it is ideal to choose a partition key in the where clause that will be used in the query, if it is a write heavy operation then look for a key which has high cardinality
Always point reads /writes are better since it consumes way less RU's than running other queries
Upvotes: 0
Reputation: 139
You are defining a logical partition. Underneath, physically the data is split into physical partitions by Azure.
Ideally a partitionKey should be a primary Key, or a field with high cardinality to ensure proper distribution, with the self generated id field within that partition also set to the primary key, that will help with documentFetchById much faster.
You cannot change a partitionKey once container is created.
Looking at the dataset, captureId is a good candidate for partitionKey, with id set manually to this field, and not an auto generated cosmos one.
Upvotes: 0
Reputation: 6260
I've put together a detailed article here Azure Cosmos DB. Partitioning.
Cosmos DB designed to scale horizontally based on the distribution of data between Physical Partitions (PP) (think of it as separately deployable underlaying self-sufficient node) and logical partition - bucket of documents with same characteristic (partition key) which is supposed to be stored fully on the same PP. So LP can't have part of the data on PP1 and another on PP2.
There are two main limitation on Physical Partitions:
Logical partition has one - 20GB limit in size.
NOTE: Since initial releases of Cosmos DB size limits grown and I won't be surprised that soon size limitations might increase.
Based on the Microsoft recommendation for maintainable data growth you should select partition key with highest cardinality (like Id
of the document or a composite field). For the main reason:
Spread request unit (RU) consumption and data storage evenly across all logical partitions. This ensures even RU consumption and storage distribution across your physical partitions.
It is critical to analyze application data consumption pattern when considering right partition key. In a very rare scenarios larger partitions might work though in the same time such solutions should implement data archiving to maintain DB size from a get-go (see example below explaining why). Otherwise you should be ready to increasing operational costs just to maintain same DB performance and potential PP data skew, unexpected "splits" and "hot" partitions.
Having very granular and small partitioning strategy will lead to an RU overhead (definitely not multiplication of RUs but rather couple additional RUs per request) in consumption of data distributed between number of physical partitions (PPs) but it will be neglectable comparing to issues occurring when data starts growing beyond 50-, 100-, 150GB.
Main reason is that Cosmos DB is designed to scale horizontally and provisioned throughput per PP is limited to the [total provisioned per container (or DB)] / [number of PP]
.
Once PP split occurs due to exceeding 50GB size your max throughput for existing PPs as well as two newly created PPs will be lower then it was before split.
So imagine following scenario (consider days as a measure of time between actions):
CustomerId
partition key (which will generate one underlying PP1). Maximum throughput per PP is 10k/1 = 10k RUs
10k/2 = 5k RUs
10k/3 = 3.333k RUs
IMPORTANT: As a result on
[Day 2]
C1
data was queried with up to 10k RUs but on[Day 4]
with only max to 3.333k RUs which directly impacts execution time of your query
This is a main thing to remember when designing partition keys in current version of Cosmos DB (12.03.21).
Upvotes: 15
Reputation: 357
Partition key acts as a logical partition.
Now, what is a logical partition, you may ask? A logical partition may vary upon your requirements; suppose you have data that can be categorized on the basis of your customers, for this customer "Id" will act as a logical partition and info for the users will be placed according to their customer Id.
What effect does this have on the query?
While querying you would put your partition key as feed options and won't include it in your filter.
e.g: If your query was
SELECT * FROM T WHERE T.CustomerId= 'CustomerId';
It will be Now
var options = new FeedOptions{ PartitionKey = new PartitionKey(CustomerId)};
var query = _client.CreateDocumentQuery(CollectionUri,$"SELECT * FROM T",options).AsDocumentQuery();
Upvotes: 14
Reputation: 2501
Each partition on a table can store up to 10GB (and a single table can store as many document schema types as you like). You have to choose your partition key though such that all the documents that get stored against that key (so fall into that partition) are under that 10GB limit.
I'm thinking about this too right now - so should the partition key be a date range of some type? In that case, it would really depend on how much data is getting stored in a period of time.
Upvotes: 3
Reputation: 463
CosmosDB can be used to store any limit of data. How it does in the back end is using partition key. Is it the same as Primary key? - NO
Primary Key: Uniquely identifies the data Partition key helps in sharding of data(For example one partition for city New York when city is a partition key).
Partitions have a limit of 10GB and the better we spread the data across partitions, the more we can use it. Though it will eventually need more connections to get data from all partitions. Example: Getting data from same partition in a query will be always faster then getting data from multiple partitions.
Upvotes: 9
Reputation: 18387
Partition Key is used for sharding, it acts as a logical partition for your data, and provides Cosmos DB with a natural boundary for distributing data across partitions.
You can read more about it here: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
Upvotes: 4