Anil Savaliya
Anil Savaliya

Reputation: 129

Removing duplicates using PigLatin and retaining the last element

I am using PigLatin. And I want to remove the duplicates from the bags and want to retain the last element of the particular key.

Input:
User1  7 LA 
User1  8 NYC 
User1  9 NYC 
User2  3 NYC
User2  4 DC 


Output:
User1  9 NYC 
User2  4 DC 

Here the first filed is a key. And I want the last record of that particular key to be retained in the output.

I know how to retain the first element. It is as below. But not able to retain the last element.

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

Can anybody help me on this? Thanks in advance!

Upvotes: 1

Views: 1954

Answers (2)

Murali Rao
Murali Rao

Reputation: 2287

@Anil : If you order by one of the fields in descending order. You will be able to get the last record. In the below code, have ordered by second field of input (field name : no in script)

Input :

User1,7,LA 
User1,8,NYC 
User1,9,NYC 
User2,3,NYC
User2,4,DC

Pig snippet :

user_details = LOAD 'user_details.csv'  USING  PigStorage(',') AS (user_name:chararray,no:long,city:chararray);

user_details_grp_user = GROUP user_details BY user_name;

required_user_details = FOREACH user_details_grp_user {
    user_details_sorted_by_no = ORDER user_details BY no DESC;
    top_record = LIMIT user_details_sorted_by_no 1;
    GENERATE FLATTEN(top_record);
}

Output : DUMP required_user_details

(User1,9,NYC )
(User2,4,DC)

Upvotes: 5

Surender Raja
Surender Raja

Reputation: 3599

Ok.. You can use RANK Operator .

Hope the below code helps.

 rec = LOAD '/user/cloudera/inputfiles/sample.txt' USING PigStorage(',') AS(user:chararray,no:int,loc:chararray);
 rec_rank = rank rec;                                                                                     
 rec_rank_each = FOREACH rec_rank GENERATE $0 as rank_key, user, no, loc;                                 
 rec_rank_grp = GROUP rec_rank_each by user; 
 rec_rank_max = FOREACH rec_rank_grp GENERATE group as temp_user, MAX(rec_rank_each.rank_key) as max_rank;
 rec_join = JOIN rec_rank_each BY (user,rank_key) , rec_rank_min BY(temp_user,max_rank);
 rec_output = FOREACH rec_join GENERATE user,no,loc;
 dump rec_output;

Ensure that you run this from pig 0.11 version as rank operator introduced from pig 0.11

Upvotes: 0

Related Questions