Reputation: 837
From the official docs we read that LEFT/RIGHT/FULL OUTER JOINS are not optimized for spatial data. I have been running several long queries that use complex joins on GEOGRAPHY
data types.
My question is, how does BigQuery deal with spatial data join under the hood? Is everything converted to Geohash?
I have tried clustering my table by a GEOGRAPHY
type column but so far speed improvements have been negligible.
If I use Geohash (STRING) in a where clause for a JOIN instead of a GEOGRAPHY
type does that result in a performance boost?
Here's an example of what I'm talking about:
select t1.Geohash, t1.Name, t1.Way, t1.Long, t1.Lat, t1.CoreInt
, t1.Label, t1.IntLat, t1.IntLong
, row_number() over(partition by Geohash order by Dist) as RowNum
, Distance
from table_name t1
left outer join (select Geohash, Label from table where CoreInt = 1) t2
using (Geohash)
where t2.Label is null
or t1.Label = t2.Label
Thanks
Upvotes: 2
Views: 1000
Reputation: 7744
Yes, BigQuery does not optimize LEFT/RIGHT/OUTER spatial JOINs yet.
For now you would need to convert such joins to INNER JOIN + select mismatched rows, see this question for example: How to JOIN in geography columns using ST_CONTAINS in Big query
Internally BigQuery uses S2 indexing. It might be faster or slower than join on geohash, depending on data. But unlike join on geohash it guarantees correct result.
Joining on Geohash has two major issues:
1) Geohash buckets are not uniform, geohashes of same fixed length describe much bigger real areas near equator than near poles. S2 provides more uniform indexing.
2) Geohash also may miss some pairs that should be joined, when two geographies are close enough but are just across geohash split boundary and thus hash to different values. For example, points just below and just above 45 parallel would have different geohash values, even if they are very close, and one expects them to join.
Upvotes: 8