Reputation: 1056
I have 3 very large tables* of IP addresses and am trying to count the number of common IPs between the 3 tables. I have considered using joins and also a subquery to find the intersection of IPs between these 3 tables. How can I find the intersection of all 3 tables with one query?
This is improper syntax, but illustrates what I'm trying to accomplish:
SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip
I have seen other answers about how to join 3 tables, but nothing for Hive and nothing for this scale.
* Notes:
Upvotes: 0
Views: 1112
Reputation: 1270201
The correct syntax is:
SELECT COUNT(DISTINCT a.ip)
FROM a JOIN
b
ON a.ip = b.ip JOIN
c
ON a.ip = c.ip;
That probably will not finish in our lifetimes. A better approach is:
select ip
from (select distinct a.ip, 1 as which from a union all
select distinct b.ip, 2 as which from b union all
select distinct c.ip, 3 as which from c
) abc
group by ip
having sum(which) = 6;
Admitted, the sum(which) = 6
is just saying that all three exist. Because of the select distinct
in the subqueries, you can just do:
having count(*) = 3
Upvotes: 3
Reputation: 44951
A straightforward solution:
select count(*)
from (select 1
from (
select 'a' as tab,ip from a
union all select 'b' as tab,ip from b
union all select 'c' as tab,ip from c
) t
group by ip
having count(case when tab = 'a' then 1 end) > 0
and count(case when tab = 'b' then 1 end) > 0
and count(case when tab = 'c' then 1 end) > 0
) t
This will give you information not only about the 3 tables intersection (in_a=1,in_b=1,in_c=1) but also information on all other combinations:
select in_a
,in_b
,in_c
,count(*) as ips
from (select max(case when tab = 'a' then 1 end) as in_a
,max(case when tab = 'b' then 1 end) as in_b
,max(case when tab = 'c' then 1 end) as in_c
from (
select 'a' as tab,ip from a
union all select 'b' as tab,ip from b
union all select 'c' as tab,ip from c
) t
group by ip
) t
group by in_a
,in_b
,in_c
... and even some more information:
select sign(cnt_a) as in_a
,sign(cnt_b) as in_b
,sign(cnt_c) as in_c
,count(*) as unique_ips
,sum(cnt_total) as total_ips
,sum(cnt_a) as total_ips_in_a
,sum(cnt_b) as total_ips_in_b
,sum(cnt_c) as total_ips_in_c
from (select count(*) as cnt_total
,count(case when tab = 'a' then 1 end) as cnt_a
,count(case when tab = 'b' then 1 end) as cnt_b
,count(case when tab = 'c' then 1 end) as cnt_c
from (
select 'a' as tab,ip from a
union all select 'b' as tab,ip from b
union all select 'c' as tab,ip from c
) t
group by ip
) t
group by sign(cnt_a)
,sign(cnt_b)
,sign(cnt_c)
Upvotes: 1