smaica
smaica

Reputation: 817

Google Bigquery Memory error when using ROW_NUMBER() on large table - ways to replace long hash by short unique identifier

For a query in google BigQuery I want to replace a long hash by a shorter numeric unique identifier to save some memory afterwards, so I do:

SELECT 
    my_hash
    , ROW_NUMBER() OVER (ORDER BY null) AS id_numeric
FROM hash_table_raw
GROUP BY my_hash

I don't even need an order in the id, but ROW_NUMBER() requires an ORDER BY.

When I try this on my dataset (> 1 billion rows) I get a memory error:

400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 126% of limit.
Top memory consumer(s):
  sort operations used for analytic OVER() clauses: 99%
  other/unattributed: 1%

Is there another way to replace a hash by an shorter identifier?

Thanks!

Upvotes: 0

Views: 264

Answers (1)

Pratik Patil
Pratik Patil

Reputation: 832

One does not really need to have populated over clause while doing this.

e.g. following will work:

select col, row_number() over() as row_num from (select 'A' as col)

So that will be your first try.

Now, with billion+ rows that you have: if above fails: you can do something like this (considering order is not at all important for you): but here you have to do it in parts:

SELECT 
    my_hash
    , ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 0

And in subsequent queries: you can get max(id_numeric) from previous run and add that as an offset to next:

SELECT 
    my_hash
    , previous_max_id_numberic_val + ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 1

And keep appending outputs of these mod queries (0-4) to a single new table.

Upvotes: 2

Related Questions