Reputation: 817
For a query in google BigQuery I want to replace a long hash by a shorter numeric unique identifier to save some memory afterwards, so I do:
SELECT
my_hash
, ROW_NUMBER() OVER (ORDER BY null) AS id_numeric
FROM hash_table_raw
GROUP BY my_hash
I don't even need an order in the id, but ROW_NUMBER()
requires an ORDER BY
.
When I try this on my dataset (> 1 billion rows) I get a memory error:
400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 126% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 99%
other/unattributed: 1%
Is there another way to replace a hash by an shorter identifier?
Thanks!
Upvotes: 0
Views: 264
Reputation: 832
One does not really need to have populated over clause while doing this.
e.g. following will work:
select col, row_number() over() as row_num from (select 'A' as col)
So that will be your first try.
Now, with billion+ rows that you have: if above fails: you can do something like this (considering order is not at all important for you): but here you have to do it in parts:
SELECT
my_hash
, ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 0
And in subsequent queries: you can get max(id_numeric) from previous run and add that as an offset to next:
SELECT
my_hash
, previous_max_id_numberic_val + ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 1
And keep appending outputs of these mod queries (0-4) to a single new table.
Upvotes: 2