Francois BAPTISTE
Francois BAPTISTE

Reputation: 163

How can I avoid drastic decrease in performance when packaging algorithms into a Bigquery SQL UDFs?

I have notice that running the same algorithm can take much longer (up to x60) when running in a Bigquery UDF than when running directly.

The two code snippets below that illustrate the problem:

The first one complete in 1.6 sec

WITH T_ AS (
    SELECT road_geom g
    FROM `bigquery-public-data.geo_us_roads.all_roads_04`
    ORDER BY ST_LENGTH(road_geom) desc
    LIMIT 1
  ),
  T0 AS (
    SELECT ST_PointN(g,idx) point, idx
    FROM T_, UNNEST(GENERATE_ARRAY(1, ST_NPOINTS(g))) AS idx
  )
SELECT point
FROM T0
ORDER BY ST_LENGTH(ST_MAKELINE(ARRAY_AGG(point) OVER ( ORDER BY idx ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING))) DESC
LIMIT 1

The second finished in 1 min 47 sec

CREATE TEMP FUNCTION
  myfunction(g GEOGRAPHY) AS ((
    WITH
        T0 AS (
    SELECT ST_PointN(g,idx) point, idx
    FROM UNNEST(GENERATE_ARRAY(1, ST_NPOINTS(g))) AS idx
  )
SELECT point
FROM T0
ORDER BY ST_LENGTH(ST_MAKELINE(ARRAY_AGG(point) OVER ( ORDER BY idx ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING))) DESC
LIMIT 1));
      
      
SELECT myfunction(road_geom)
FROM `bigquery-public-data.geo_us_roads.all_roads_04`
ORDER BY ST_LENGTH(road_geom) DESC
LIMIT 1

Upvotes: 0

Views: 151

Answers (1)

Michael Entin
Michael Entin

Reputation: 7744

These seems to be somewhat different queries.

First one computes a single row in T_ by taking the longest road, then uses it in T0 computation to extract a list of points, and then computes longest segment.

Second one computes this for every road, and only then takes a single result by taking the longest road. I guess the optimizer is not quite smart enough to rewrite the second query.

I suggest writing the last query as following, which makes it equally fast:

SELECT myfunction(road_geom)
FROM (
    SELECT * 
    FROM `bigquery-public-data.geo_us_roads.all_roads_04`
    ORDER BY ST_LENGTH(road_geom) DESC
    LIMIT 1
)

Upvotes: 2

Related Questions