Reputation: 28
I have two tables. Table 1
has all the unique places (30 rows) that I am interested in:
places |
---|
japan |
china |
india |
... |
Table 2
has all the information for IDs, places visited and date.
id | places | date |
---|---|---|
10001 | japan | 20210204 |
10001 | australia | 20210204 |
10001 | china | 20210204 |
10001 | argentina | 20210205 |
10002 | spain | 20210204 |
10002 | india | 20210204 |
10002 | china | 20210205 |
10003 | argentina | 20210204 |
10003 | portugal | 20210204 |
What I am interested to get is:
IDs
from Table 2 that has visited at least one of the places
from Table 1IDs
to a temp table.Here is what I have tried:
create temporary table imp.unique_ids_tmp
as select distinct(final.id) from
(select t2.id
from table2 as t2
where t2.date = '20210204'
and t2.places in
(select * from table1)) final;
I am struggling to incorporate the "at least one" logic such that once a satisfying id
is found, it stops looking at those id
records.
Upvotes: 1
Views: 538
Reputation: 38325
Use left semi join (implements uncorrelated EXISTS in efficient way), it will filter only records joined, after that apply distinct:
create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
from table2 t2
left semi join table1 t1 on t2.places = t1.places
where t2.date = '20210204'
;
At "least once" condition will be satisfied: IDs which have no joined records will not present in the dataset.
Another way is to use correlated EXISTS:
create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
from table2 t2
where t2.date = '20210204'
--this condition is true as soon as one match is found
and exists (select 1 from table1 t1 where t2.places = t1.places)
;
IN also will work.
Correllated EXIST looks close to "once a satisfying id is found, it stops looking at those id records", but all these methods are implemented using JOIN in Hive. Execute EXPLAIN and you will see, it will be the same plan generated, though it depends on implementation in your version. Potentially EXISTS can be faster because do not need to check all records in the subquery. Taking into account that your table1 with 30 rows is small enough to fit in memory, MAP-JOIN (set hive.auto.convert.join=true;
) will give you the best performance.
One more fast method using array or IN(static_list). It can be used for small and static arrays. Ordered array may give you better performance:
select distinct t2.id --distinct is not a function, do not need ()
from table2 t2
where t2.date = '20210204'
and array_contains(array('australia', 'china', 'japan', ... ), t2.places)
--OR use t2.places IN ('australia', 'china', 'japan', ... )
Why this method is faster: Because no need to start mapper and calculate splits to read table from hdfs, only table2 will be read. The drawback is that list of values is static. On the other hand, you can pass whole list as a parameter, see here.
Upvotes: 1