Reputation: 139
I have two tables, one with 166M address points and one with 170,000 precinct polygons. I want to intersect the points through the polygons to get a table containing the address id and the precinct id for the precinct the address is found in. To avoid issues with precincts that overlap state boundaries (because this is a merged dataset of individual state datasets), I have added a condition that the state in each file be the same.
To improve performance, both input tables have been clustered by their respective geometry fields, an inner join is used, and the input tables are materialized as tables (as per the documentation on using spatial joins). However, the process times out before it finishes.
To troubleshoot, I ensured that the precinct polygons have the expected orientation (which was the issue in this post), and I have also reviewed the BigQuery documentation on optimizing queries
The first thing I tried was a standard inner join:
SELECT
addr.id_address,
prec.id_precinct,
prec.geom
FROM precincts AS prec
INNER JOIN addresses AS addr
ON prec.geo_state = addr.geo_state
AND ST_INTERSECTS(addr.geom, prec.geom)
I have also tried using a window function:
SELECT *
FROM (
SELECT
addr.id_address,
prec.id_precinct,
prec.geom,
ROW_NUMBER() OVER(PARTITION BY addr.geo_state) AS rn
FROM addresses AS addr
INNER JOIN precincts AS prec
ON prec.geo_state = addr.geo_state
AND ST_INTERSECTS(addr.geom, prec.geom)
) AS ranked
WHERE rn = 1
In both cases, it times out before finishing. Any help on how to optimize this query is much appreciated. Note that I am using dbt
to run this, and I have my timeout set to 1800 sec (it times out when running it in the BigQuery UI, though, too).
Upvotes: 1
Views: 832
Reputation: 7744
If the query runs fast enough when you remove the state equality condition prec.geo_state = addr.geo_state
, the solution is to remove the condition and filter the result with a separate query, or turn it into something where both sides of =
depend on left and right join children, e.g.
CONCAT(prec.geo_state, ".", addr.geo_state) = CONCAT(addr.geo_state, ".", prec.geo_state)
Explanation: BigQuery does not yet support efficient mixing of different types of JOIN conditions. If it sees both an equality condition and a geospatial condition in a single join, it "prefers" the equality join, executing the geospatial predicate as a regular filter condition, instead of building and using geospatial indices.
That's why this trick often improves performance: it "hides" the equality condition, makes BigQuery use geospatial join, and only then check the now-complex equality condition. Of course, it only works if the query without state equality predicate runs reasonably fast.
Upvotes: 2
Reputation: 105
Michael, I've read the article and this post was hoping for some clarification on how to apply this. I have this
SELECT columns
FROM a
JOIN b
ON a.county_fips IN UNNEST(b.county_fips_list)
AND ST_INTERSECTS(a.geometry, b.geopoint)
WHERE poi.client_id IN UNNEST(partner_list)
The county_fips are string types and county_fips_list is an array of strings.
How would suggest I optimize this? This just a part of a larger query but have isolated this is where it bottle necks
Should I move the county fips outside of the join to the where clause beneath? Thank you in advance!
Upvotes: 1