user3206440
user3206440

Reputation: 5049

comparing usage of inner join and where in

with two tables - all_data and selected_place_day_hours

all_data has place_id, day, hour, metric

selected_place_day_hours has fields place_id, day, hour

I need to subset all_data such that only records with place_id, day, hour in selected_place_day_hours are selected.

I can go two ways about it

1.Use inner join

select a.*
from all_data as a
inner join selected_place_day_hours as b
on (a.place_id = b.place_id)
and ( a.day = b.day)
and ( a.hour = b.hour)
;

2.Use where in

select *
 from all_data
 where 
   place_id in (select place_id from selected_place_day_hours)
   and day in (select day from selected_place_day_hours)
   and hour in (select day from selected_place_day_hours)
;

I want to get some idea on why, when, if you would choose one over the other from a functional and performance perspective ?

One thought is that in #2 above, probably sub-selects is not performance friendly and also longer code.

Upvotes: 0

Views: 48

Answers (3)

Manngo
Manngo

Reputation: 16301

These days, SQL tends to ignore what you say and do its own thing.

This is why SQL is a declarative language, not a programming language: you tell it what you want, not how to do it. The SQL interpreter will work out what you want and devise its own plan for how to get the results.

In this case, the 2 versions will probably produce an identical plan, regardless of how you write it. In any case, the plan chosen will be the most efficient one.

The reasons to prefer the join syntax over the older where syntax are:

  • to look cool: you don’t want anybody catching you with code that is old-fashioned
  • the join syntax is easy to adapt to outer joins
  • the join syntax allows you to separate the join part from additional filter by distinguishing between join and where

The reasons do not include whether one is better, because the interpreter will handle that.

Upvotes: 1

Pham X. Bach
Pham X. Bach

Reputation: 5442

These are some more notes that are too long for a comment.

First it should be showed that your two queries is different. (Maybe the 2nd query you wrote is a wrong query)

For example:

all_data

place_id    day     hour    other_cols...
1           4       3       ....

selected_place_day_hours

place_id    day     hour
1           4       9
4444        4444    6 

Then your 1st query will get no row in return, and your 2nd will return (1, 4, 6)

One more note is that if (place_id, day, hour) is unique, your first query is in same purpose of following query

SELECT *
FROM all_data
WHERE 
    (place_id, day, hour) IN (
        SELECT place_id, day, hour
        FROM selected_place_day_hours
    );

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269873

The two are semantically different.

The IN does a semi-join, meaning that it returns one from all_data regardless of how many rows are matched in selected_place_day_hours.

The JOIN can return multiple rows.

So, the first piece of advice is to use the version that is correct for what you want to accomplish.

Assuming the data in select_place_day_hours guarantees at most one match, then you have an issue with performance. The first piece of advice is to try both queries on your data and on your system. However, often JOIN is optimized at least as well as IN, so that would usually be a safe choice.

Upvotes: 1

Related Questions