alamoot
alamoot

Reputation: 2144

Big Query User Defined Function dramatically slows down the query

I have a long query written in standard SQL dialect where I join multiple tables together. I had performance issue and my query was taking at least 9.5 seconds to run. I turned off caching and began tweaking the query to make it faster, but only saw marginal progress.

I had a User Defined Function (UDF) defined at the top of the query to handle division by 0 as follows:

CREATE TEMPORARY FUNCTION divide(numerator FLOAT64, denominator FLOAT64)
RETURNS STRING
LANGUAGE js AS """
  return (denominator === 0) ? "Not Available" : (100 * (numerator/denominator)) + "%";
""";

I noticed that I was not calling the UDF anywhere in the query anymore, so I removed it. To my surprise the query ran much faster after the removal. I ran it a bunch of times and it never took more than 4.5 seconds.

Why does just creating a UDF add so much to the run time of the query?

Upvotes: 1

Views: 2346

Answers (2)

El Yobo
El Yobo

Reputation: 14956

This may not have been possible at the time, but BigQuery also supports a native SAFE_DIVIDE function which handles the unsafe division for you.

You can wrap in COALESCE if you want to provide a fallback, e.g.

SELECT
  COALESCE(
    (100 * SAFE_DIVIDE(a, b)) || '%',
    'Not available'
  ) AS whatever,
FROM wherever

And you can wrap that in a single SQL UDF if it needs to be shared across multiple locations, not just this query.

Upvotes: 0

Elliott Brossard
Elliott Brossard

Reputation: 33765

When a user-defined JavaScript function is present in the query text, BigQuery initializes a JavaScript environment with the function's contents on every shard of execution. There is (at the time of this writing) no optimization to avoid loading the environment if the function is not referenced, since the expectation is that if there is a JavaScript UDF present, the intent is probably to use it. The discrepancy that you are seeing is due to the start-up time of the JavaScript environment.

With SQL UDFs, however, the story is different. While BigQuery still has to parse the SQL UDFs regardless of whether you use them in order to figure out where the actual query starts, there is minimal overhead associated with that.

Upvotes: 3

Related Questions