Reputation: 5049
with two tables - all_data
and selected_place_day_hours
all_data
has place_id
, day
, hour
, metric
selected_place_day_hours
has fields place_id
, day
, hour
I need to subset all_data
such that only records with place_id
, day
, hour
in selected_place_day_hours
are selected.
I can go two ways about it
1.Use inner join
select a.*
from all_data as a
inner join selected_place_day_hours as b
on (a.place_id = b.place_id)
and ( a.day = b.day)
and ( a.hour = b.hour)
;
2.Use where in
select *
from all_data
where
place_id in (select place_id from selected_place_day_hours)
and day in (select day from selected_place_day_hours)
and hour in (select day from selected_place_day_hours)
;
I want to get some idea on why, when, if you would choose one over the other from a functional and performance perspective ?
One thought is that in #2 above, probably sub-selects is not performance friendly and also longer code.
Upvotes: 0
Views: 48
Reputation: 16301
These days, SQL tends to ignore what you say and do its own thing.
This is why SQL is a declarative language, not a programming language: you tell it what you want, not how to do it. The SQL interpreter will work out what you want and devise its own plan for how to get the results.
In this case, the 2 versions will probably produce an identical plan, regardless of how you write it. In any case, the plan chosen will be the most efficient one.
The reasons to prefer the join
syntax over the older where
syntax are:
join
syntax is easy to adapt to outer joinsjoin
syntax allows you to separate the join part from additional filter by distinguishing between join
and where
The reasons do not include whether one is better, because the interpreter will handle that.
Upvotes: 1
Reputation: 5442
These are some more notes that are too long for a comment.
First it should be showed that your two queries is different. (Maybe the 2nd query you wrote is a wrong query)
For example:
all_data
place_id day hour other_cols...
1 4 3 ....
selected_place_day_hours
place_id day hour
1 4 9
4444 4444 6
Then your 1st query will get no row in return, and your 2nd will return (1, 4, 6)
One more note is that if (place_id, day, hour)
is unique, your first query is in same purpose of following query
SELECT *
FROM all_data
WHERE
(place_id, day, hour) IN (
SELECT place_id, day, hour
FROM selected_place_day_hours
);
Upvotes: 0
Reputation: 1269873
The two are semantically different.
The IN
does a semi-join, meaning that it returns one from all_data
regardless of how many rows are matched in selected_place_day_hours
.
The JOIN
can return multiple rows.
So, the first piece of advice is to use the version that is correct for what you want to accomplish.
Assuming the data in select_place_day_hours
guarantees at most one match, then you have an issue with performance. The first piece of advice is to try both queries on your data and on your system. However, often JOIN
is optimized at least as well as IN
, so that would usually be a safe choice.
Upvotes: 1