Reputation: 314
I have two tables:
Table A:
Table B:
I want to get all the rows in B where either B.id1 or B.id2 is A.id (all A.id are distinct. This isn't the case for B).
I've tried the query:
create table C as
select B.*
from B
join A on (A.id = B.id1 or A.id = B.id2);
This works great except it duplicates rows where id1 and id2 are both in id. A is a relatively small table (around 5000 rows) whereas B is quite large (around 1 billion rows).
Solutions that I've thought of involve grouping by the columns of B so it eliminates distinct rows, collect the ids of A into an array and searching for id1 and id2 in the array, and using 2 separate queries which search for id1 and id2 respectively and then union the two tables. These all seem quite inefficient especially considering the size of the tables. Is there a better way of doing this that I'm missing? Thanks.
Upvotes: 0
Views: 4447
Reputation: 1269483
I would use exists
:
select b.*
from b
where exists (select 1 from a where a.id = b.id1) or
exists (select 1 from a where a.id = b.id2);
In most databases, this would be the most efficient method for this type of logic. I'm not 100% sure that this is true in Hive, but it is definitely worth a try.
An alternative approach would be left join
s:
select b.*
from b left join
a a1
on b.id1 = a1.id left join
a a2
on b.id2 = a2.id
where a1.id is not null or a2.id is not null;
This might have better performance in Hive, if the exists
does not have good optimization.
Upvotes: 3