Workaround for UDF memory leak

Question

Apparently there is a memory leak on BigQuery's UDF. We run a simple UDF over a small table (3000 rows, 5MB) and it fails. If we run the same UDF over the first half of the table concatenated with the second half of the table (in the same query), then it works! Namely:
SELECT blah myUDF(SELECT id,data FROM table)
fails.
SELECT blah myUDF(SELECT id, data FROM table ORDER BY id LIMIT 1500),myUDF(SELECT id, data FROM table ORDER BY id DESC LIMIT 1500)
succeeds.

The question is: how do we work around this issue? Is there a way to dynamically split a table in multiple parts, each of equal size and of predefined number of rows? Say 1000 rows at a time? (the sample table has 3000 rows, but we want this to succeed in larger tables, and if we split a 6000 row table in half, the UDF will be failing again on each half).

In any solution, it is important to (a) NOT use ORDER BY, because it has a 65000 row limitation; (b) use a single combined query (otherwise the solution may be too slow, plus every combined table is charged at a minimum of 10MB, so if we have to split a 1,000,000 row table into 1,000 rows at a time we will automatically be charged for 10 GB. Times 1,000 tables = 10TB. This stuff adds up quickly)
Any ideas?

thomaspark · Accepted Answer

This issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.

In the meantime, we've rolled out a fix to update the max data segment size. I've verified that this fixes several other queries that were failing for the same reason.

Could you please retry your query? I'm afraid I can't easily get to your code resources in GCS.

Workaround for UDF memory leak

Answers (1)

Related Questions