Reputation: 1
I've got a couple hundred million records across the US clustered on a state/county FIPS code column (3K separate values) followed by a Geography column. Until I added the clustering, spatial joins were timing out after 8 hours, while now they run in a couple minutes. Now I read that queries are supposed to include all clustered fields in order, or the clustering won't provide any benefit. My joins with ST_INTERSECTS only make use of geometries and don't include FIPS code. Can anyone explain why I'm seeing a clustering benefit, even though my queries are not using the clustered fields in clustered field order? Could it be that a geography column added to a BigQuery table's clustered columns can be inserted in any order, and that queries only need to use the order of the non-geography columns to reap benefits?
Upvotes: 0
Views: 496
Reputation: 7764
While most improvements comes from using the clustered column in the query, clustering by a related column helps too.
The clustering makes sure the rows with similar values are stored together. This allows BigQuery to quickly prune large chunks of storage when it can be sure they don't have any values used in the query.
Say you have a clustered column foo
. Clustering makes sure the storage is broken into chunks with continuous foo
values, say (0 - 100), (100 - 200), etc. Now let's consider three cases:
WHERE foo = 42
we know we only need to read first file.WHERE bar = 101
and values of bar
don't correlate with foo
- it does not help at all, we have to read all chunks.bar
is related to foo
, e.g. it always equals to foo + 100
, the same chunks contain continuous ranges of bar
as well! So we have chunks with bar
ranges (100-200), (200-300), ... and query with condition WHERE bar = 101
only needs to read first file.The last case probably explains what happens here: FIPS codes are spatially related, two consecutive FIPS codes are likely to be (but not always) close to each other. So the table is (weakly) spatially clustered too and ST_INTERSECTS
filter works faster. It is likely thought the clustering by actual geography value might give slightly better performance.
Upvotes: 1