Reputation: 602

What is the difference between Micro-partitions and Data Clustering

I was reading this link. and trying to understand what is the relation between both, it is quite confusing. Please explain

Upvotes: 0

Answers (2)

Simeon Pilgrim

Reputation: 25938

Much like Gokhan's answer, but I would describe it differently.

Micro-partitions:

Every time to write data to snowflake it's written to a new file, because the files are immutable. This means you have many fragments. But due to keep metadata for tables, when you query, Snowflake can prune tables known to not contain the data being looked for. Otherwise it load all data (for the columns you select) and does brute force "full table scans". Snowflake micro-partitions have no relation to classic partitioning, except for if you are lucky you might get pruning. Also the micro partitions are written how you load the data, and are just "split" into more partitions as your writes go over a threshold. Much like you get from WinZip/7Zip/Gzip with a max file chunk size parameter.. The next thing to note is if you update a ROW in a partition, the whole partition is rewritten. AND the order of the updated rows is not controllable and can be random (based of your table join logic). Thus if you do many writes, or many "small" updates, you will have very bad fragmentation of your partitions, which very negatively impact compile time, as all the meta data needs to be loaded. Which they are now charging for. This is like this because S3 is immutable file store, but this is also why you can separate compute form data. It's also how "timetravel" and "history days data retention" work, because the prior state of the table is not deleted for this time, thus why you pay for the S3 storage. Which also means watchout for your churn as you pay for all data written cumulatively for days

Data-Clustering:

Is a way to indicate how you would like the data to be ordered by. And ether the legacy manual cluster commands or the auto-clustering will rewrite the partitions to improve the clustering. Think of Norton SpeedDisk (if you are old school) Writing to table in the order you want it clustered in (aka always have an ORDER BY on your INSERT), will improve things. But you can only have a table clustered on one set of "KEYS", thus you need to think about how your mostly use the data before clustering it. Or have multiple copies of the data with the min-sub-set/sort behavior we need (we do this). warning: UPDATEs currently do not respect this clustering, and you can pay upwards of 4x the cost of a full table rewrite by having auto clustering running, you need to watch this as it's potentially unbounded cost.

So in short clustering is like poor man's indexes, and Snowflake is basically a massive full table scan/map/reduce processing. But it's really good at that, and when you understand how it works, it's super fun to use.

Upvotes: 5

Gokhan Atil

Reputation: 10069

Snowflake stored data in micro-partitions. Each micro-partition contains between 50 MB and 500 MB of uncompressed data. If you are familiar with partition on traditional databases, micro-partitions are very similar to them but micro-partitions are automatically generated by Snowflake. You do not need to partition table as you need to do in traditional database systems.

Data Clustering is to distribute the data based on a clustering key into these micro-partitions. If clustering is not enabled for your table, your table will still have micro-partitions but the data will not be distributed based on a specific key.

Let's assume we have A, B, C, D unique values of X column in our table (t), and we have 5 partitions:

P1: AABBC
P2: ABDAC
P3: BBBCA
P4: CBDCC
P5: BBCCD

If we try to run "SELECT * FROM t WHERE X=A" query, Snowflake needs to read P1, P2, P3 partitions. If this table is clustered based on X column, the data will be distributed liked this (in theory):

P1: AAAAA
P2: BBBBB
P3: BBBBC
P4: CCCCC
P5: CCDDD

In this case, when we run "SELECT * FROM t WHERE X=A" query, Snowflake will need to read only P1 partition.

Micro-partitions (or partitioning) is very important when accessing a portion of data in a large table, because Snowflake can prune partitions based on your filter conditions in your query. If a right key (column) is defined for clustering, the partition pruning would be much more effective.

Upvotes: 1

What is the difference between Micro-partitions and Data Clustering

Answers (2)

Related Questions