Hive: Find unique values if condition met between two tables

Question

I have two tables. Table 1 has all the unique places (30 rows) that I am interested in:

places
japan
china
india
...

Table 2 has all the information for IDs, places visited and date.

id	places	date
10001	japan	20210204
10001	australia	20210204
10001	china	20210204
10001	argentina	20210205
10002	spain	20210204
10002	india	20210204
10002	china	20210205
10003	argentina	20210204
10003	portugal	20210204

What I am interested to get is:

For a specific date (let's say 20210204)
Find all unique IDs from Table 2 that has visited at least one of the places from Table 1
Save those unique IDs to a temp table.

Here is what I have tried:

create temporary table imp.unique_ids_tmp
as select distinct(final.id) from
(select t2.id
from table2 as t2
where t2.date = '20210204'
and t2.places in 
(select * from table1)) final;

I am struggling to incorporate the "at least one" logic such that once a satisfying id is found, it stops looking at those id records.

leftjoin · Accepted Answer

Use left semi join (implements uncorrelated EXISTS in efficient way), it will filter only records joined, after that apply distinct:

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
       left semi join table1 t1 on t2.places = t1.places
 where t2.date = '20210204'
;

At "least once" condition will be satisfied: IDs which have no joined records will not present in the dataset.

Another way is to use correlated EXISTS:

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204' 
   --this condition is true as soon as one match is found
   and exists (select 1 from table1 t1 where t2.places = t1.places)
;

IN also will work.

Correllated EXIST looks close to "once a satisfying id is found, it stops looking at those id records", but all these methods are implemented using JOIN in Hive. Execute EXPLAIN and you will see, it will be the same plan generated, though it depends on implementation in your version. Potentially EXISTS can be faster because do not need to check all records in the subquery. Taking into account that your table1 with 30 rows is small enough to fit in memory, MAP-JOIN (set hive.auto.convert.join=true;) will give you the best performance.

One more fast method using array or IN(static_list). It can be used for small and static arrays. Ordered array may give you better performance:

select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204'
       and array_contains(array('australia', 'china', 'japan', ... ), t2.places)
       --OR use t2.places IN ('australia', 'china', 'japan', ... )

Why this method is faster: Because no need to start mapper and calculate splits to read table from hdfs, only table2 will be read. The drawback is that list of values is static. On the other hand, you can pass whole list as a parameter, see here.

Hive: Find unique values if condition met between two tables

Answers (1)

Related Questions