Reputation: 187
I have a table in Azure Data Explorer that collects data from IoT sensors. In the near future it will collect millions of records each day. So to get the best query performance I am looking into setting a partitioning policy: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/partitioningpolicy
My table has 5 important columns: TenantId, DeviceId, SensorId, Value, Timestamp
The combination of (TenantId, DeviceId, VariableId) makes a sensor globally unique, and almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'. All these columns are of type string, and have a high-cardinality (10.000+ Tenants, 1000+ DeviceIds, 10.000+ VariableIds)
Two questions:
Would it be wise to apply partitioning on this table based on one or more of the string columns? It complies with the advice in the documentation that says:
But later on the page, for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.
Upvotes: 1
Views: 920
Reputation: 25955
almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'.
Given this (and assuming you don't frequently join
on any of these 3 columns), you could extend your data set with a new column, which is the concatenation of these 3 (e.g. strcat_delim("_", TenantId, DevideId, VariableId
).
You can do this either before ingestion into Kusto (better), or - using an update policy
at ingestion time.
Then, set that new column as the hash partition key
in the table's data partitioning policy
.
for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.
Let's assume you have a cluster with 20
nodes, a column C
with cardinality 10,000,000
, and you want to set it as the table's hash partition key
.
Following the guidelines in the documentation regarding MaxPartitionCount
:
(1,1024]
. -> MaxPartitionCount
should be larger than 1
and lower than or equal to 1024
.MaxPartitionCount
should be larger than 20
.MaxPartitionCount
should be lower than 10,000,000
.256
.
Adjust the value as needed, based on the above considerations, or based on the benefit in query performance vs. the overhead of partitioning the data post-ingestion.As I don't see any conflicting information here (256
> 1
, 256
<= 1024
, 256
> 20
, 256
< 10M
) - you may want to clarify where the confusion is coming from.
Upvotes: 2