Reputation: 1932
It's very well known that micro-partitions within snowflake architecture power the optimizer query approach to render data results faster when users needs to know data per column that points out:
However, checking the documentation the micro-partitions also keep information about the number of distinct values. I have been trying to test when the query optimizer can come in the hood to avoid deploys a computing layer and then render data quick and without need of computing tasks.
I tried the MAX, MIN, COUNT and for these the results , the execution was render without any computing layer and in a very decent time. However I tried to execute a COUNT DISTINCT but here I notice the computing layer was deployed before show up results:
So, the way how the micro-partitions benefit the query optimizer is keep summarize data available but still when the query demands distinct counts or AVG it's requiered a computing operation?
thanks.
Upvotes: 0
Views: 455
Reputation: 6279
I'm not from Snowflake but I assume the reason is because it is much more complex (requires more space + compute) to gather statistics for distinct counts and averages as data is being ingested into the tables than it does for MIN, MAX and COUNT.
If you want to store MIN, MAX and COUNT for each micro-partition in the metadata layer then you're only storing a maximum of 3 values for each micro-partition per column. If you wanted to store the distinct count it's completely dependant on the data within the micro-partition itself (can be many many distinct values with a micro-partition each with a count of 1 for example).
I don't think storing averages or distinct counts per micro-partition makes a lot of sense since when you do your average/distinct calculations you'll almost always be calculating them across micro-partitions anyway which means you'd have to recalculate the value all the time. Also how would you average a string?
Also storing a global value for MIN/MAX/COUNT and updating these values as new data comes into the table is probably very quick and has a very predictable compute and metadata storage footprint but calculating global values for AVG and COUNT DISTINCT do not.
Upvotes: 2