TheProletariat
TheProletariat

Reputation: 1056

How can I query 3 large tables for Intersecting values using Hive?

I have 3 very large tables* of IP addresses and am trying to count the number of common IPs between the 3 tables. I have considered using joins and also a subquery to find the intersection of IPs between these 3 tables. How can I find the intersection of all 3 tables with one query?

This is improper syntax, but illustrates what I'm trying to accomplish:

SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip

I have seen other answers about how to join 3 tables, but nothing for Hive and nothing for this scale.

* Notes:

Upvotes: 0

Views: 1112

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1270201

The correct syntax is:

SELECT COUNT(DISTINCT a.ip)
FROM a JOIN
     b
     ON a.ip = b.ip JOIN
     c
     ON a.ip  = c.ip;

That probably will not finish in our lifetimes. A better approach is:

select ip
from (select distinct a.ip, 1 as which from a union all
      select distinct b.ip, 2 as which from b union all
      select distinct c.ip, 3 as which from c
     ) abc
group by ip
having sum(which) = 6;

Admitted, the sum(which) = 6 is just saying that all three exist. Because of the select distinct in the subqueries, you can just do:

having count(*) = 3

Upvotes: 3

David דודו Markovitz
David דודו Markovitz

Reputation: 44951

A straightforward solution:

select      count(*)

from       (select      1

            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t

            group by    ip

            having      count(case when tab = 'a' then 1 end) > 0
                    and count(case when tab = 'b' then 1 end) > 0
                    and count(case when tab = 'c' then 1 end) > 0

            ) t

This will give you information not only about the 3 tables intersection (in_a=1,in_b=1,in_c=1) but also information on all other combinations:

select      in_a
           ,in_b
           ,in_c
           ,count(*)    as ips

from       (select      max(case when tab = 'a' then 1 end)  as in_a
                       ,max(case when tab = 'b' then 1 end)  as in_b
                       ,max(case when tab = 'c' then 1 end)  as in_c

            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t

            group by    ip
            ) t

group by    in_a
           ,in_b
           ,in_c

... and even some more information:

select      sign(cnt_a)                 as in_a
           ,sign(cnt_b)                 as in_b
           ,sign(cnt_c)                 as in_c

           ,count(*)                    as unique_ips
           ,sum(cnt_total)              as total_ips
           ,sum(cnt_a)                  as total_ips_in_a
           ,sum(cnt_b)                  as total_ips_in_b
           ,sum(cnt_c)                  as total_ips_in_c

from       (select      count(*)                                as cnt_total
                       ,count(case when tab = 'a' then 1 end)   as cnt_a
                       ,count(case when tab = 'b' then 1 end)   as cnt_b
                       ,count(case when tab = 'c' then 1 end)   as cnt_c

            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t

            group by    ip
            ) t

group by    sign(cnt_a)
           ,sign(cnt_b)
           ,sign(cnt_c)

Upvotes: 1

Related Questions