Reputation: 11
I am attempting to run a large BigQuery query (pasted below) to generate a new derived table. The source table has three array columns and I want to calculate the median of one array, the mode of another, and the value counts for the third. The source table has 160k rows and the arrays are large. The query can not complete in six hours.
However, if I split the query into multiple queries, the smaller queries are much faster. The query is simple applying a transformation to each row of the source table to produce a row of the output table, so it is simple to split the query up into multiple queries, each of which processes one or more rows.
This solution can work for me (although I need to find out how to increase my simultaneous query limit beyond 100), but it seems that BigQuery should be able to do this sort of parallelization under the hood. Is there some functionality that allows me to help BigQuery parallelize appropriately? It seems what I'm doing is not overwhelmingly large and, with appropriate parallelization, BigQuery should be able to handle it in less than six hours.
CREATE OR REPLACE TABLE derived.carbontest400agg AS
SELECT
cell,
minx,
maxx,
x,
y,
PERCENTILE_CONT(ndvi, 0.5) OVER(PARTITION BY x, y) AS median_ndvi,
(
SELECT
mode
FROM (
SELECT
ROW_NUMBER() OVER(ORDER BY c DESC) AS rn,
cz AS mode
FROM (
SELECT
cz,
COUNT(cz) AS c
FROM
UNNEST(cz_values) cz
GROUP BY
cz
)
)
WHERE rn = 1
) AS cz_mode,
(SELECT ARRAY(
SELECT AS STRUCT
lc,
COUNT(lc) AS count
FROM
UNNEST(lc_values) AS lc
GROUP BY
lc
) AS lc_counts) AS lc_counts
FROM
`abiding-center-290412.derived.carbontest400`
CROSS JOIN
UNNEST(ndvi_values) AS ndvi
Upvotes: 1
Views: 792
Reputation: 10222
Maybe removing cross join
will help:
CREATE OR REPLACE TABLE derived.carbontest400agg AS
SELECT
cell,
minx,
maxx,
x,
y,
(SELECT PERCENTILE_CONT(ndvi, 0.5) OVER() FROM UNNEST(ndvi_values) AS ndvi LIMIT 1) AS median_ndvi,
(
SELECT
mode
FROM (
SELECT
ROW_NUMBER() OVER(ORDER BY c DESC) AS rn,
cz AS mode
FROM (
SELECT
cz,
COUNT(cz) AS c
FROM
UNNEST(cz_values) cz
GROUP BY
cz
)
)
WHERE rn = 1
) AS cz_mode,
(SELECT ARRAY(
SELECT AS STRUCT
lc,
COUNT(lc) AS count
FROM
UNNEST(lc_values) AS lc
GROUP BY
lc
) AS lc_counts) AS lc_counts
FROM
`abiding-center-290412.derived.carbontest400`
Upvotes: 0