Reputation: 2144
I have a long query written in standard SQL dialect where I join multiple tables together. I had performance issue and my query was taking at least 9.5 seconds to run. I turned off caching and began tweaking the query to make it faster, but only saw marginal progress.
I had a User Defined Function (UDF) defined at the top of the query to handle division by 0 as follows:
CREATE TEMPORARY FUNCTION divide(numerator FLOAT64, denominator FLOAT64)
RETURNS STRING
LANGUAGE js AS """
return (denominator === 0) ? "Not Available" : (100 * (numerator/denominator)) + "%";
""";
I noticed that I was not calling the UDF anywhere in the query anymore, so I removed it. To my surprise the query ran much faster after the removal. I ran it a bunch of times and it never took more than 4.5 seconds.
Why does just creating a UDF add so much to the run time of the query?
Upvotes: 1
Views: 2346
Reputation: 14956
This may not have been possible at the time, but BigQuery also supports a native SAFE_DIVIDE
function which handles the unsafe division for you.
You can wrap in COALESCE if you want to provide a fallback, e.g.
SELECT
COALESCE(
(100 * SAFE_DIVIDE(a, b)) || '%',
'Not available'
) AS whatever,
FROM wherever
And you can wrap that in a single SQL UDF if it needs to be shared across multiple locations, not just this query.
Upvotes: 0
Reputation: 33765
When a user-defined JavaScript function is present in the query text, BigQuery initializes a JavaScript environment with the function's contents on every shard of execution. There is (at the time of this writing) no optimization to avoid loading the environment if the function is not referenced, since the expectation is that if there is a JavaScript UDF present, the intent is probably to use it. The discrepancy that you are seeing is due to the start-up time of the JavaScript environment.
With SQL UDFs, however, the story is different. While BigQuery still has to parse the SQL UDFs regardless of whether you use them in order to figure out where the actual query starts, there is minimal overhead associated with that.
Upvotes: 3