M_K_
M_K_

Reputation: 45

Clustering using DBSCAN in bigquery

I have a Bigquery table with only one column named 'point'. It contains location coordinates that I want to cluster using the ST_CLUSTERDBSCAN function in BigQuery.

I use the following query:

SELECT ST_CLUSTERDBSCAN(point, 2000, 200) OVER () AS cluster_num 
FROM mytable

I get this error:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 128% of limit. Top memory consumer(s): analytic OVER() clauses: 97% other/unattributed: 3%

From what I understand, this is because the query is memory intensive. Is there any way I can use cluster my data given that my table contains millions of rows?

Upvotes: 2

Views: 1968

Answers (2)

John Powell
John Powell

Reputation: 12581

To add to Michael's answer, one of the issues I found was that the S2 grid, being regular, didn't match varying densities on the ground. So, you tend to end up with much smaller partitions than you want, just because you have to set the S2 level appropriate for the most dense area. In the case of the UK, where there are ready made grids at different scale, I created a mixed 100km and 20km grid, see below. The smaller squares cover the larger cities.

Another option would be to use a hex grid, which Carto have made available to BiqQuery via jslibs.h3.

There are numerous other options, such as recursively dividing the space similar to KD-tree construction, until the largest remaining input partition is guaranteed to fit on one shard.

Depending on what one is trying to do, there is then the additional problem of combining clusters which cross whatever subdivision is used in the OVER clause. There are solutions to this, such as using ST_Union and ST_Intersects and merging adjacent clusters, but this is beyond the original question. Ultimately, this is why you want to keep the partitions as large as possible, but it will reduce the amount of work needed to recombine clusters, assuming you want to do this.

UK mixed grid

Upvotes: 2

Michael Entin
Michael Entin

Reputation: 7764

Most analytic functions in BigQuery currently run one partition on a single shard (machine), and thus the partition size is limited in memory to about 1GB data size. In your query, OVER () means there is no partitioning - all data is run in a single partition.

The solution usually is to partition data on some large granularity. E.g. if the data has some spatial hierarchy, you can partition by this column - e.g. do OVER(PARTITION BY state). Of course, it means there will be no cross-state clusters, so the result is not exactly the same, but if there is a natural clustering this is usually reasonable.

If such intrinsic hierarchy is not available, another option is to partition by, say, a short geohash (with very few letters - just as many as needed to avoid the resource exceeded errors), something like OVER(PARTITION BY st_geohash(point, 2)). A good option is S2_CellIDFromPoint(ST_Centroid(geo, level)), see S2 cell sizes for choosing the cell level.

Upvotes: 4

Related Questions