yan-hic
yan-hic

Reputation: 1544

BigQuery: LAG() on UNNEST()

I was looking at using LAG() to return a value from the previous row in an ARRAY of STRUCT, and it seems like the navigation window functions like LAG/LEAD are not supported with UNNEST.

It sounds suboptimal but you need to UNNEST all and (re)PARTITION BY !!

E.g. say you want the amount of the previous line:

EDIT: Schema details:

Invoices schema : docID STRING, lines STRUCT<lineID STRING, amount NUMERIC>

Expected output: docID STRING, lines STRUCT<lineID STRING, amount NUMERIC, prev_amount NUMERIC>

SELECT 
   docID, lineID, amount
   ,LAG(amount) OVER (PARTITION BY docID ORDER by lineID) prev_amount
FROM invoices, UNNEST(lines)

This is pretty slow. I wish we could somehow pass the ARRAY as window frame to use.

Any alternative ? I am using BigQuery but this may apply to other DBMS.

UPDATE: I have come up with below alternative using WITH OFFSET. Faster and maybe this is what OFFSET is meant to be used for, it's just ugly and could be simpler in my mind.

SELECT 
   docID, lineID, amount
   ,(SELECT amount FROM UNNEST(lines) WITH OFFSET AS pos2 WHERE pos2 = pos - 1) prev_amount
FROM invoices, UNNEST(lines) WITH OFFSET AS pos

Upvotes: 1

Views: 2035

Answers (2)

yan-hic
yan-hic

Reputation: 1544

Below alternative to Mikhail's answer, using WITH OFFSET instead, performs better (55s vs. 170s, over 50M rows)

#standardSQL
SELECT docID, ARRAY(
    SELECT AS STRUCT 
      l.*, 
      lines[SAFE_OFFSET(pos - 1)].amount prev_amount 
    FROM UNNEST(lines) l WITH OFFSET pos
    ) lines
FROM invoices

Note: this will not work as-is if ORDER BY or a WHERE clause is added to the unnesting that changes the original indexing.

This was my use case, and if you are after relative positioning in an filtered array, you should use a subquery to create the filtered array first, and then the combination UNNEST WITH OFFSET + OFFSET/ORDINAL in the outer query.

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Below shows better performance (at least on small amount that I run it against for quick testing). Also execution plan looks much better (to me at least)

#standardSQL
SELECT 
   lines[OFFSET(pos)].*, lines[SAFE_OFFSET(pos - 1)].amount prev_amount
FROM invoices, UNNEST(GENERATE_ARRAY(0, ARRAY_LENGTH(lines) - 1)) pos

Note: I was assuming following schema [STRUCT<docID INT64, lineID INT64, amount FLOAT64>] lines

UPDATE: Adjustment to match to real schema you just provided

#standardSQL
SELECT docID, ARRAY(
    SELECT AS STRUCT 
      lines[OFFSET(pos)].*, 
      lines[SAFE_OFFSET(pos - 1)].amount prev_amount 
    FROM UNNEST(GENERATE_ARRAY(0, ARRAY_LENGTH(lines) - 1)) pos
    ) lines
FROM invoices

Upvotes: 1

Related Questions