sds
sds

Reputation: 60074

Set intersection/difference in SQL?

I have two tables t1 and t2, each defining a multimap from id to word:

> select * from t1;
id word
1  foo
1  bar
2  baz
2  quux

and

> select * from t2;
id word
1  foo
1  baz
3  baz

What I want is to find out the sizes of the union and intersection of the word sets for each id:

id  t1_union_t2 t1 t2 t2_minus_t1  t1_minus_t2 t1_intersect_t2
1   3           2  2  1            1           1
2   2           2  0  0            2           0
3   1           0  1  1            0           0

Obviously, the columns are not independent, e.g.,

t1_union_t2 = t1 + t2_minus_t1 = t1 + t2 - t1_intersect_t2

I want all of them for consistency checking only.

Upvotes: 0

Views: 85

Answers (2)

peterm
peterm

Reputation: 92845

One way to do it with FULL JOIN

SELECT COALESCE(t1.id, t2.id) id,
       COUNT(*) t1_union_t2,
       COUNT(t1.id) t1,
       COUNT(t2.id) t2,
       SUM(CASE WHEN t1.id IS NULL THEN 1 ELSE 0 END) t2_minus_t1,
       SUM(CASE WHEN t2.id IS NULL THEN 1 ELSE 0 END) t1_minus_t2,
       SUM(CASE WHEN t1.id = t2.id THEN 1 ELSE 0 END) t1_intersect_t2
  FROM t1 FULL JOIN t2
    ON t1.id = t2.id
   AND t1.word = t2.word
 GROUP BY COALESCE(t1.id, t2.id);

Output:

| ID | T1_UNION_T2 | T1 | T2 | T2_MINUS_T1 | T1_MINUS_T2 | T1_INTERSECT_T2 |
|----|-------------|----|----|-------------|-------------|-----------------|
|  1 |           3 |  2 |  2 |           1 |           1 |               1 |
|  2 |           2 |  2 |  0 |           0 |           2 |               0 |
|  3 |           1 |  0 |  1 |           1 |           0 |               0 |

Here is a SQLFiddle demo

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1271151

Here is how I approach this in SQL:

select numtable1, numtable2, count(*) as numwords, min(id) as minid, max(id) as maxid
from (select id, word, sum(istable1) as numtable1, sum(istable2) as numtable2
      from ((select id, word, 1 as istable1, 0 as istable2
             from table1
            ) union all
            (select id, word, 0 as istable1, 1 as istable2
             from table2
            )
           ) t
      group by id, word
     ) t
group by numtable1, numtable2;

This identifies duplicates within each table as well as between them.

Hive supports subqueries in the from clause, so this might also work in Hive.

Upvotes: 2

Related Questions