Zachary Vance
Zachary Vance

Reputation: 862

How can I make this query run efficiently?

In BigQuery, we're trying to run:

SELECT day, AVG(value)/(1024*1024) FROM ( 
    SELECT value, UTC_USEC_TO_DAY(timestamp) as day, 
         PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank 
    FROM [Datastore.PerformanceDatum]
    WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91 
GROUP BY day 
ORDER BY day desc;

which returns a relatively small amount of data. But we're getting the message:

Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby

What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?


Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.

Upvotes: 3

Views: 335

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

I wrote an equivalent query that works for me:

SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day, 
         PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank 
    FROM [io_sensor_data.moscone_io13]
    WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91 
GROUP BY day 
ORDER BY day desc;

If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?

The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:

SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day, 
         PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank 
    FROM [io_sensor_data.moscone_io13]
    WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91 
GROUP BY day 
ORDER BY day desc;

Runs too, now with 57,862 different groups.

I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:

    FROM [io_sensor_data.moscone_io13]

To:

    FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]

Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?

Upvotes: 1

Related Questions