Evert Wiesenekker
Evert Wiesenekker

Reputation: 23

Confusing partitioning key of CosmosDB

I am working my way through the Python examples of CosmosDB (see CosmosDB for Python) and I see a container definition as follows:

    partition_key = PartitionKey(path='/id', kind='Hash')
    db.create_container(id=id, partition_key=partition_key)

Code for reading an item:

response = container.read_item(item=doc_id, partition_key=doc_id)

Now my confusion is why is a partition key chosen which is the same as a unique document id. So, what is the use of partitioning here?

In my opinion, partition is something which applies over keys sharing some common group, for example partition over food groups.

Upvotes: 0

Views: 1825

Answers (2)

Aboo
Aboo

Reputation: 2454

In my opinion it's wrong that the documentation of CosmosDB has used id as an example for partition key. Most probably it's because they don't have anything else to use as partition key.

When you partition your CosmosDB database you need to consider a few factors.

First, you need to know the domain and traffic models. Without the domain knowledge it's almost impossible to come up with an optimised partition key. That (I think) is exactly why the documentations have used id as the partition key because it's the maximum possible number of logical partitions in an unidentified domain.

This is 100% a wrong decision unless you want id to be your partition key because of some requirements in your domain that suggests that.

Do not use id as partition key!

Keep in mind that you are in charge of logical partitions by choosing a good partition key. However, you are NOT in charge of physical partitions which makes the scaling possible. Azure will randomly put x number of logical partitions together in a physical partition. So potentially "hot" logical partitions can sit in one physical partition and make a "hot" shard.

Defining the best partition key is almost impossible without the business domain knowledge.

Choosing the maximum possible number of partitions (for example id) is not the most optimised decision and will result in bad performance and scaling. Remember that there are certain activities that are only possible within a single partition and you cannot do them between multiple partitions or containers.

For example:

  • If you want to make an atomic operation between a few documents, that is only possible if those documents are in a single partition
  • Unique fields are only unique in context of one partition
  • Maximum number of partitions will make most of your queries multi-partition-queries which are generally more expensive.

Choose a partition key that at any given time distributes your traffic as well as storage in the most balanced way

After you know the business then you need to choose a partition key that at any given time distributes your traffic as well as storage in the most balanced way.

Be prepared to repartition your data.

Once you developed your software and it goes live you may learn new things about the load and users which can potentially result in choosing a more optimised partition key. Be prepared to re-partition your data if that happens.

Define a property on the document just for the purpose of partitioning even if it's an exact copy of the value of another property. This will give you flexibility to repartition your data based on any value you put for that field.

In our case, we have a property called section on every document no matter where it is and what type it is. Then based on the logic we have around document types that are in each container we have different ways of generating that value.

Last tip is that, having single type containers will make the partitioning easier. For example it's easier to partition a container that includes only users. But if you have users and orders together in one container then you want the orders that belong to a user to sit on the same partition as the user is. This will get more complicated if you have more data types in a single container.

Upvotes: 2

Anupam Chand
Anupam Chand

Reputation: 2662

In my opinion, partition is something which applies over keys sharing some common group, for example partition over food groups.

This is not entirely true. If you look at the documentation, it says that you should choose a partition key that has a high cardinality. In other words, the property should have a wide range of possible values. It should be a value that will not change. You also need to note that if you want to update or delete a document, you will need to pass the partition key.

What happens in the background, is Cosmos can have multiple servers from 1 to infinity. It uses your partition key to logically partition your data. But it is still on one server. If your throughput goes beyond 10K RU or if your storage goes beyond 50GB, Cosmos will automatically split into 2 physical servers. This means your data is split into the 2 servers. The splitting can go on until the max throughput per server is < 10K RU and storage per server is < 50GB. This is how Cosmos can manage infinite scale. You may ask how would you predict which partition a document may go into. The answer is you can't. Cosmos produces a hash using your partition key with a value between 1 and the number of servers.

So the doc id is a good partition key because it is unique and can have a large range of values.

Just be aware that once Cosmos partitions to multiple servers, there is no automatic way currently to bring the number of servers down even if you reduce the storage or reduce the throughout.

Upvotes: 5

Related Questions