Jesse McCrosky
Jesse McCrosky

Reputation: 11

Helping BigQuery parallelize more effectively

I am attempting to run a large BigQuery query (pasted below) to generate a new derived table. The source table has three array columns and I want to calculate the median of one array, the mode of another, and the value counts for the third. The source table has 160k rows and the arrays are large. The query can not complete in six hours.

However, if I split the query into multiple queries, the smaller queries are much faster. The query is simple applying a transformation to each row of the source table to produce a row of the output table, so it is simple to split the query up into multiple queries, each of which processes one or more rows.

This solution can work for me (although I need to find out how to increase my simultaneous query limit beyond 100), but it seems that BigQuery should be able to do this sort of parallelization under the hood. Is there some functionality that allows me to help BigQuery parallelize appropriately? It seems what I'm doing is not overwhelmingly large and, with appropriate parallelization, BigQuery should be able to handle it in less than six hours.

CREATE OR REPLACE TABLE derived.carbontest400agg AS
SELECT
  cell,
  minx,
  maxx,
  x,
  y,
  PERCENTILE_CONT(ndvi, 0.5) OVER(PARTITION BY x, y) AS median_ndvi,
  (
    SELECT
      mode
    FROM (
      SELECT
        ROW_NUMBER() OVER(ORDER BY c DESC) AS rn,
        cz AS mode
      FROM (
        SELECT
          cz,
          COUNT(cz) AS c
        FROM
          UNNEST(cz_values) cz
        GROUP BY
          cz
      )
     )
     WHERE rn = 1
   ) AS cz_mode,
   (SELECT ARRAY(
       SELECT AS STRUCT
         lc,
         COUNT(lc) AS count
       FROM
         UNNEST(lc_values) AS lc
       GROUP BY
         lc
     ) AS lc_counts) AS lc_counts
FROM
  `abiding-center-290412.derived.carbontest400`
CROSS JOIN
  UNNEST(ndvi_values) AS ndvi

Upvotes: 1

Views: 792

Answers (1)

Sergey Geron
Sergey Geron

Reputation: 10222

Maybe removing cross join will help:

CREATE OR REPLACE TABLE derived.carbontest400agg AS
SELECT
  cell,
  minx,
  maxx,
  x,
  y,
  (SELECT PERCENTILE_CONT(ndvi, 0.5) OVER() FROM UNNEST(ndvi_values) AS ndvi LIMIT 1) AS median_ndvi,
  (
    SELECT
      mode
    FROM (
      SELECT
        ROW_NUMBER() OVER(ORDER BY c DESC) AS rn,
        cz AS mode
      FROM (
        SELECT
          cz,
          COUNT(cz) AS c
        FROM
          UNNEST(cz_values) cz
        GROUP BY
          cz
      )
     )
     WHERE rn = 1
   ) AS cz_mode,
   (SELECT ARRAY(
       SELECT AS STRUCT
         lc,
         COUNT(lc) AS count
       FROM
         UNNEST(lc_values) AS lc
       GROUP BY
         lc
     ) AS lc_counts) AS lc_counts
FROM
  `abiding-center-290412.derived.carbontest400`
  

Upvotes: 0

Related Questions