Chaoming Li
Chaoming Li

Reputation: 235

Bigquery "NOT IN" comparison performance too slow

It seems something has changed with "NOT IN" comparison? The performance is very poor compared to a month ago.

I have query like this:

SELECT SOMETHING FROM X WHERE KEY NOT IN (SELECT KEY FROM Y)

Y returns 45,000 keys.

X contains 84,000 records.

This query takes more than 1 minutes to complete while using IN comparison only takes few seconds.

The actual query is more complex than this but I have tried to remove the complex part and pretty much narrow it down to the cause as the "NOT IN" comparison.

I had run this query back in August with much larger dataset and it wasn't like this slow. I am wondering if there was any changes in "NOT IN" operation. And if there is any workaround to improve the performance.

Execution Details Screenshot

Upvotes: 3

Views: 945

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59165

These 3 queries should behave in a similar way (same results and similar performance), but somehow NOT IN and NOT EXISTS is behaving way slower right now.

I created a bug to track this performance hit, as it should only be transitory (https://issuetracker.google.com/issues/116839201).

SELECT tags, COUNT(*) c, ANY_VALUE(b.value)
FROM `bigquery-public-data.stackoverflow.posts_questions` a
LEFT JOIN (SELECT x.value FROM UNNEST((
  SELECT APPROX_TOP_COUNT(tags, 10000) 
  FROM `bigquery-public-data.stackoverflow.posts_questions` 
)) x ) b
ON a.tags=b.value
WHERE b.value IS NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1000

12 seconds, fh-bigquery:US.bquijob_3c0fdf82_1661f6f3dd1

SELECT tags, COUNT(*) c
FROM `bigquery-public-data.stackoverflow.posts_questions` 
WHERE tags NOT IN(SELECT x.value FROM UNNEST((
  SELECT APPROX_TOP_COUNT(tags, 10000) 
  FROM `bigquery-public-data.stackoverflow.posts_questions` 
)) x)
GROUP BY 1
ORDER BY 2 DESC, 1
LIMIT 1000

> 400 seconds, fh-bigquery:US.bquijob_766cc8ab_1661f7023bb

SELECT tags, COUNT(*) c
FROM `bigquery-public-data.stackoverflow.posts_questions` 
WHERE NOT EXISTS(SELECT x.value FROM UNNEST((
  SELECT APPROX_TOP_COUNT(tags, 10000) 
  FROM `bigquery-public-data.stackoverflow.posts_questions` 
)) x WHERE tags=value)
GROUP BY 1
ORDER BY 2 DESC, 1
LIMIT 1000

> 400 seconds, fh-bigquery:US.bquijob_59a9d1e6_1661f59db40

Generally speaking: NOT EXISTS should be preferable to NOT IN, as it behaves better under null values.

Upvotes: 1

Related Questions