Reputation: 1757
I want to remove the duplicate rows in pig. There are a lot of ways, but I am not sure if which one is better.
Here is the data set, the schema is (f0,f1,id,f3,f4):
1,2,3,2015-02-21,2015-02-20
1,2,3,2015-02-22,2015-02-20
1,2,3,2015-02-23,2015-02-20
1,2,4,2015-02-24,2015-02-20
1,2,5,2015-02-25,2015-02-20
If any of rows whose f0,f1 and id are equal, then they are considered to be the duplicate. And I want to output one of them where f3 is minimum.
But I also want to output which ids have the duplicates.
That is, I will store or dump two relations.
one of both relations are:
1,2,3,2015-02-21,2015-02-20
1,2,4,2015-02-24,2015-02-20
1,2,5,2015-02-25,2015-02-20
The other one is the id which has the duplicate rows, the schema is (id,f4)
3,2015-02-20
That is, id=3 has the duplicate data.
Here is my workaround
r1 = LOAD 'data' USING PigStorage(',');
r2 = group r1 by ($0,$1,$2);
r3 = FOREACH r2 GENERATE COUNT(r1) as c, r1;
SPLIT r3 into r4 if c > 1, r5 if c == 1;
r6 = FOREACH r5 GENERATE flatten(r1);
dups_id = FOREACH r4 {
GENERATE flatten(r1.$2),flatten(r1.$4);
};
r7= distinct dups_id
dump r7
no_dups = FOREACH r4 {
sorted = ORDER r1 by $3 ASC;
lim = limit sorted 1;
GENERATE flatten(lim);
};
r8 = union no_dups,r6
dump r8
I think that this is a little complicated, and I doubt the performance.
Is there any other better idea can implement this use case?
Upvotes: 0
Views: 2799
Reputation: 371
Here is how I would do it.
r1 = LOAD 'data' USING PigStorage(',');
r2 = group r1 by ($0,$1,$2);
r3 = FOREACH r2 GENERATE $0.., SIZE($1) AS size;
DEFINE MYTOP TOP('ASC');
r8 = FOREACH r2 {
GENERATE MYTOP(1, 3, r1);
};
dups = FILTER r3 BY size > 1L;
dups2 = FOREACH dups GENERATE FLATTEN($1);
dups3 = FOREACH dups2 GENERATE $2, $4;
dups_id = DISTINCT dups3;
dump r8;
dump dups_id;
Upvotes: 1