xivzgrev
xivzgrev

Reputation: 101

In BigQuery SQL, what's generally the most efficient way to join & filter a second table?

In BigQuery, what's the most efficient way (from performance standpoint) to complete the following task?

Objective: Select rows from table_a that match with US-based rows in table_b.

I see at least three different ways to go about this task.

1) Using a subquery to filter

SELECT * FROM table_a
JOIN (select * from table_b where country='US') table_b
ON table_a.userid = table_b.userid

2) Using join clauses to filter

SELECT * FROM table_a
JOIN table_b
ON table_a.userid = table_b.userid
  AND table_b.country='US'

3) Adding where clause at end

SELECT * FROM table_a
JOIN table_b
ON table_a.userid = table_b.userid 
WHERE table_b.country='US'

Upvotes: 1

Views: 6587

Answers (3)

t.Chovatia
t.Chovatia

Reputation: 31

The query with a subquery will be the slowest, then the 3rd query and the most optimised one is the 2nd one as it only uses join whereas the 3rd query uses both join and a WHERE clause.

You can refer to these stackoverflow posts, to get a deeper understanding. Difference between joins and subquery : Join vs. sub-query Difference between joins and where clause : Is a JOIN faster than a WHERE?

Hope this helps !

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172994

In your case - I would go with below to narrow down JOIN volume as early as possible. Even though BigQuery Engine is smart enought to come up with this optimization by its own for you no matter which version (of query out of three in question) you use

SELECT * 
FROM table_a 
JOIN (SELECT * FROM table_b WHERE country='US') table_b 
ON table_a.userid = table_b.userid  

but if you need only rows from table_a - I would go with something like

SELECT a.* 
FROM table_a a
JOIN (SELECT DISTINCT userid FROM table_b WHERE country='US') table_b 
ON a.userid = table_b.userid  

Upvotes: 0

Elliott Brossard
Elliott Brossard

Reputation: 33705

If you use standard SQL for these queries, then they will all execute in the same way, which you can confirm by looking at the query plan explanation after executing them. Specifically, BigQuery applies the following transformations:

  • BigQuery pushes filters through INNER JOINs, so putting the WHERE in a subquery prior to the join is analogous to having it appear outside the join.
  • BigQuery extracts uncorrelated filters from the ON clause, so ON table_b.country='US' is treated the same as WHERE table_b.country='US'.

If you use legacy SQL instead, however, then you need to move the condition inside the subquery prior to the join, since BigQuery does not "push down" filters through joins when using legacy SQL.

Upvotes: 5

Related Questions