Clustering using DBSCAN in bigquery

Question

I have a Bigquery table with only one column named 'point'. It contains location coordinates that I want to cluster using the ST_CLUSTERDBSCAN function in BigQuery.

I use the following query:

SELECT ST_CLUSTERDBSCAN(point, 2000, 200) OVER () AS cluster_num 
FROM mytable

I get this error:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 128% of limit. Top memory consumer(s): analytic OVER() clauses: 97% other/unattributed: 3%

From what I understand, this is because the query is memory intensive. Is there any way I can use cluster my data given that my table contains millions of rows?

Michael Entin · Accepted Answer

Most analytic functions in BigQuery currently run one partition on a single shard (machine), and thus the partition size is limited in memory to about 1GB data size. In your query, OVER () means there is no partitioning - all data is run in a single partition.

The solution usually is to partition data on some large granularity. E.g. if the data has some spatial hierarchy, you can partition by this column - e.g. do OVER(PARTITION BY state). Of course, it means there will be no cross-state clusters, so the result is not exactly the same, but if there is a natural clustering this is usually reasonable.

If such intrinsic hierarchy is not available, another option is to partition by, say, a short geohash (with very few letters - just as many as needed to avoid the resource exceeded errors), something like OVER(PARTITION BY st_geohash(point, 2)). A good option is S2_CellIDFromPoint(ST_Centroid(geo, level)), see S2 cell sizes for choosing the cell level.

Clustering using DBSCAN in bigquery

Answers (2)

Related Questions