Pete Nagy
Pete Nagy

Reputation: 1

BigQuery clustering on geography and other columns

I've got a couple hundred million records across the US clustered on a state/county FIPS code column (3K separate values) followed by a Geography column. Until I added the clustering, spatial joins were timing out after 8 hours, while now they run in a couple minutes. Now I read that queries are supposed to include all clustered fields in order, or the clustering won't provide any benefit. My joins with ST_INTERSECTS only make use of geometries and don't include FIPS code. Can anyone explain why I'm seeing a clustering benefit, even though my queries are not using the clustered fields in clustered field order? Could it be that a geography column added to a BigQuery table's clustered columns can be inserted in any order, and that queries only need to use the order of the non-geography columns to reap benefits?

Upvotes: 0

Views: 496

Answers (1)

Michael Entin
Michael Entin

Reputation: 7764

While most improvements comes from using the clustered column in the query, clustering by a related column helps too.

The clustering makes sure the rows with similar values are stored together. This allows BigQuery to quickly prune large chunks of storage when it can be sure they don't have any values used in the query.

Say you have a clustered column foo. Clustering makes sure the storage is broken into chunks with continuous foo values, say (0 - 100), (100 - 200), etc. Now let's consider three cases:

  • If you query with condition WHERE foo = 42 we know we only need to read first file.
  • If you query with condition WHERE bar = 101 and values of bar don't correlate with foo - it does not help at all, we have to read all chunks.
  • If column bar is related to foo, e.g. it always equals to foo + 100, the same chunks contain continuous ranges of bar as well! So we have chunks with bar ranges (100-200), (200-300), ... and query with condition WHERE bar = 101 only needs to read first file.

The last case probably explains what happens here: FIPS codes are spatially related, two consecutive FIPS codes are likely to be (but not always) close to each other. So the table is (weakly) spatially clustered too and ST_INTERSECTS filter works faster. It is likely thought the clustering by actual geography value might give slightly better performance.

Upvotes: 1

Related Questions