geowav
geowav

Reputation: 103

HiveQL: how to remove duplicate rows based on two columns

I am creating a table of undirected graph like below.

+-------------------+------------------------+----------------------+
|     id            |     node_a             |        node_b        |
+-------------------+------------------------+----------------------+
|     1             |         a              |           b          |
+-------------------+------------------------+----------------------+
|     2             |         a              |           c          |
+-------------------+------------------------+----------------------+
|     3             |         a              |           d          |
+-------------------+------------------------+----------------------+
|     4             |         b              |           a          |
+-------------------+------------------------+----------------------+
|     5             |         b              |           c          |
+-------------------+------------------------+----------------------+
...

row id=1 and id=4 are duplicate rows and one shall be deleted. What would be an efficient way to remove all duplicate rows in this table?

Upvotes: 1

Views: 163

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269483

You can generate the distinct rows by doing:

select e.*
from edges e
where e.node_a < e.node_b
union all
select e.*
from edges e
where e.node_a > e.node_b and
      not exists (select 1
                  from edges e2 
                  where e2.node_a = e.node_b and e2.node_b = e.node_a
                 );

If you actually have duplicates that are not transposed, using union instead of union all.

The above preserves the original edges in the table. If that is not a concern, a simple method is:

select distinct least(node_a, node_b) as node_a, greatest(node_a, node_b) as node_b
from edges e;

Upvotes: 2

Related Questions