Resources exceeded during query execution BigQuery when using LAST_VALUE() OVER()

I have a quite long table that is 2.47GB big and has 112.6 million rows. The table has 8 columns, being the first one of TIMESTAMP type and the other 7 FLOAT type. The table is partiotioned (day) and clustered by datetime.

The table looks like this:

datetime                    col1     col2     col3    col4    col5    col6     col7
2020-07-29 07:19:24.920 UTC null     null     null    null    null    null     0.01 
2020-07-29 16:41:53.094 UTC 1.3344   null     null    0.6975  null    null     0.01 
2020-07-29 08:05:23.705 UTC null     1.698    null    null    null    null     0.01 
2020-07-29 18:12:21.396 UTC 1.33435  null     null    null    null    null     0.01 
2020-07-29 19:49:26.073 UTC null     1.12345  null    null 1.33435    null     0.01 
2020-07-29 19:33:21.540 UTC null     null     null    null 1.33377    null     0.01
2020-07-29 04:11:24.596 UTC 1.256    null     null    null 1.33694    null     0.01     
2020-07-29 09:27:05.052 UTC null     null     null    0.6868  null    104.889  0.01

What I'm trying to do is to populate other table with the same values of this table but performing a fowrward fill of null values, which I was reading in docs that can be achieved using LAST_VALUE function.

The query I'm trying to perform uses LAST_VALUE with OVER for all columns except for the last one.

INSERT project.dataset.table (datetime, col1,col2,col3,col4,col5,col6)

WITH current AS(
SELECT
datetime,
   LAST_VALUE(col1 IGNORE NULLS) OVER (ORDER BY datetime) AS col1,
   
   LAST_VALUE(col2 IGNORE NULLS) OVER (ORDER BY datetime) AS col2,
   
   LAST_VALUE(col3 IGNORE NULLS) OVER (ORDER BY datetime) AS col3,
   
   LAST_VALUE(col4 IGNORE NULLS) OVER (ORDER BY datetime) AS col4,
   
   LAST_VALUE(col5 IGNORE NULLS) OVER (ORDER BY datetime) AS col5,
   
   LAST_VALUE(col6 IGNORE NULLS) OVER (ORDER BY datetime) AS col6,
   
FROM
  project.dataset.origin_table

ORDER BY
  datetime)
  select * from current

However, running this query returns this error:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 100% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 97% other/unattributed: 3%

So, for reading the error message, its clear that OVER is the issue, unless I'm missing something or I'm not understanding something.

How can I achieve this? The desired output would be like this:

datetime                    col1     col2     col3    col4      col5    col6     col7
2020-07-29 07:19:24.920 UTC null     null     null    null      null    null     0.01   
2020-07-29 16:41:53.094 UTC 1.3344   null     null    0.6975    null    null     0.01   
2020-07-29 08:05:23.705 UTC 1.3344   1.698    null    0.6975    null    null     0.01   
2020-07-29 18:12:21.396 UTC 1.33435  1.698    null    0.6975    null    null     0.01   
2020-07-29 19:49:26.073 UTC 1.33435  1.12345  null    0.6975   1.33435  null     0.01   
2020-07-29 19:33:21.540 UTC 1.33435  1.12345  null    0.6975   1.33377  null     0.01
2020-07-29 04:11:24.596 UTC 1.256    1.12345  null    0.6975   1.33694  null     0.01   
2020-07-29 09:27:05.052 UTC 1.256    1.12345  null    0.6868   1.33694  104.889  0.01

To fill null values with the last value encountered in each column.

Thank you!

Upvotes: 2

Views: 725

Answers (2)

Sarrión
Sarrión

Reputation: 77

The error message you're seeing refers to the sorting part of your OVER clause, i.e., "ORDER BY".

When using analytic functions, all the data in each of the final partitions need to be handled by the same worker. When the amount of data to be handled, and in this case, sorted, is really high, sometimes it can be allocated in memory and fails. At the moment there's no option to increase the memory on these situations, only workarounds.

Try to work with only one column in a first stage, and then paste all the columns in a single table.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270713

The problem isn't specifically LAST_VALUE(). The problem is the lack of PARTITION BY in the window specification. You can validate this with a little trick, which would be to introduce the date as a partitioning element.

So, does this work?

LAST_VALUE (col1 IGNORE NULLS) OVER (PARTITION BY DATETIME_TRUNC(datetime, day) ORDER BY datetime) AS col1,

I am not saying that this will return the results you want, but it should fix the resources exceeded problem.

Then . . . If might be that you have a natural PARTITION BY that should be used. If so, use it. If not, there may be other ways to get the results you want.

Upvotes: 0

Related Questions