Why is clustering depth still high despite sorting before inserting to table

Question

I have a big table.

CREATE OR REPLACE BIG_TABLE(
    EVENT_DATETIME TIMESTAMP_LTZ(9), -- Bigger granularity and cardinality than anything else 
    SOURCE_ID VARCHAR(16777216),     -- Two possibilities: SOURCE 1 or SOURCE 2
    COL_TYPE VARCHAR(16777216),      -- Several hundred possibilities, 80 % rows belong to 20% of types
    COL_JSON VARIANT,
    ROW_CREATED_AT TIMESTAMP_LTZ(9),
);

I attempted to improve the natural clustering by reordering the table:

INSERT INTO ORDERED_BIG_TABLE 
SELECT * FROM BIG_TABLE ORDER BY EVENT_DATETIME::DATE, COL_TYPE, COL_JSON:EVENT_ID;

alter table BIG_TABLE swap with ORDERED_BIG_TABLE ;

However, when I look at average clustering depth for EVENT_DATETIME::DATE, it is about 30 on a 15000 Partition table.

Here is the output of SYSTEM$CLUSTERING_INFORMATION ('BIG_TABLE', '(EVENT_DATETIME::DATE,COL_TYPE)')

  {      
  "cluster_by_keys" : "LINEAR(EVENT_DATETIME::DATE, COL_TYPE)",
  "total_partition_count" : 15105,
  "total_constant_partition_count" : 10179,
  "average_overlaps" : 34.70654,
  "average_depth" : 34.27224,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 10169,
    "00002" : 0,
    "00003" : 1,
    "00004" : 4179,
    "00005" : 32,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0,
    "08192" : 722
  }
}

Greg Pavlik · Accepted Answer

You'll see this type of output when you specify a candidate clustering key in system$clustering_information that is higher in cardinality than the number of micropartitions in the table.

When you specify a multi-part clustering key, the cardinality of each part multiplies the other. In this case, the distinct values of EVENT_DATETIME::date get multiplied by COL_TYPE. From your information we know there are "several hundred" values for this, with some high cardinality values and some low cardinality values.

Assuming the table has 1 year's worth of data, that's cardinality 365. Multiplying this by 300 for "several hundred" gives us 109,500. This cardinality is 7.4 times greater than the 15,105 micropartitions on the table.

So the system$clustering_information output is showing the output of this question:

What would happen if we packed 109,500 things into 15,105 containers?

The answer is, of course, some of the containers will have to hold more than one of each thing.

This is why the total_constant_partition_count is comparatively low. Dividing this number by the total_partition_count shows that about one in three micropartitions have to overlap key values in order to have enough data to fill a micropartition.

If you will maintain this table's clustering state with periodic CTAS with order by, this order by key is fine. If you plan to use the autoclustering service, you'll want to reduce the cardinality of the key to prevent overworking the service to marginal benefit in performance.

Why is clustering depth still high despite sorting before inserting to table

Answers (1)

Related Questions