Dany Majard
Dany Majard

Reputation: 103

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me. I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.

I ran the following query:

#StandardSQL
SELECT timestamp,
    block_id,
    FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`

But it process 492 GB when run and throws the following error:

Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..

It seems so simple, I must be missing something. Do you have an idea about how to handle such task?

Upvotes: 2

Views: 5834

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:

#StandardSQL
SELECT timestamp,
    block_id,
    transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10

That first row from the array still has some nested data, that you might want to flatten to only their first row too:

#standardSQL
SELECT timestamp
    , block_id
    , transactions[OFFSET(0)].transaction_id first_transaction_id
    , transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
    , transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

Upvotes: 3

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

#standardSQL
SELECT * EXCEPT(transactions),
  (SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`    

Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema

#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks@1529518619028]     

The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858

#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000) 

Upvotes: 5

Related Questions