user3596100
user3596100

Reputation: 187

Azure Data Explorer partitioning strategy

I have a table in Azure Data Explorer that collects data from IoT sensors. In the near future it will collect millions of records each day. So to get the best query performance I am looking into setting a partitioning policy: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/partitioningpolicy

My table has 5 important columns: TenantId, DeviceId, SensorId, Value, Timestamp

The combination of (TenantId, DeviceId, VariableId) makes a sensor globally unique, and almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'. All these columns are of type string, and have a high-cardinality (10.000+ Tenants, 1000+ DeviceIds, 10.000+ VariableIds)

Two questions:

  1. Would it be wise to apply partitioning on this table based on one or more of the string columns? It complies with the advice in the documentation that says:

    • The majority of queries use equality filters (==, in()).
    • The majority of queries aggregate/join on a specific string-typed column of large-dimension (cardinality of 10M or higher) such as an application_ID, a tenant_ID, or a user_ID.

But later on the page, for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.

  1. Would it be best to concat the string columns before ingestion and partition on the new column? Or only on TenantId for example?

Upvotes: 1

Views: 920

Answers (1)

Yoni L.
Yoni L.

Reputation: 25955

almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'.

Given this (and assuming you don't frequently join on any of these 3 columns), you could extend your data set with a new column, which is the concatenation of these 3 (e.g. strcat_delim("_", TenantId, DevideId, VariableId).

You can do this either before ingestion into Kusto (better), or - using an update policy at ingestion time.

Then, set that new column as the hash partition key in the table's data partitioning policy.


for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.

Let's assume you have a cluster with 20 nodes, a column C with cardinality 10,000,000, and you want to set it as the table's hash partition key.

Following the guidelines in the documentation regarding MaxPartitionCount:

  • Supported values are in the range (1,1024]. -> MaxPartitionCount should be larger than 1 and lower than or equal to 1024.
  • The value is expected to be larger than the number of nodes in the cluster -> MaxPartitionCount should be larger than 20.
  • The value is expected to be smaller than the cardinality of the column -> MaxPartitionCount should be lower than 10,000,000.
  • We recommend that you start with a value of 256. Adjust the value as needed, based on the above considerations, or based on the benefit in query performance vs. the overhead of partitioning the data post-ingestion.

As I don't see any conflicting information here (256 > 1, 256 <= 1024, 256 > 20, 256 < 10M) - you may want to clarify where the confusion is coming from.

Upvotes: 2

Related Questions