Reputation: 41

Finding duplicates in PIg

If I have a table with duplicate rows for an id,

I can find it using Hive with the follwoing query

create table dupe as select * from table1 group by id having count(*) > 1;

Can we perform the same function using Pig?

If yes, can someone please help me with that?

Upvotes: 0

Answers (1)

zsxwing

Reputation: 20826

The following codes may help you:

r1 = load ...;
r2 = group r1 by id;
r3 = foreach r2 generate COUNT(r1) as c, r1;
r4 = filter r3 by c > 1;
r5 = foreach r4 generate FLATTEN(r1);
dump r5;

However, the order is not reserved.

Upvotes: 5

Finding duplicates in PIg

Answers (1)

Related Questions