Reputation: 1004

Removing duplicate pair in Pig

I am having the below sample

Update:

OBR|1|METABOLIC PANEL
OBX|1|Glucose
OBX|2|BUN
OBX|3|CREATININE
OBR|2|RFLX TO VERIFICATION
OBX|1|EGFR
OBX|2|SODIUM
OBR|3|AMBIGUOUS DEFAULT
OBX|1|POTASSIUM

In this sample consider all the OBR as one Test and every OBR is followed by OBX which is the result of the OBR. Every OBR is followd by id (such as 1,2 and 3 in this case) all the OBX of a particular OBR starts with 1. SO what i was thing is if i found one OBR i'll create one unique id and put it in all the OBX followed by the OBR untill i reach the OBR with id 2 again i'll do the same. Below is my expected output.

Expected Result :

OBR|1|METABOLIC PANEL|OBR_filename_1
OBX|1|Glucose|OBR_filename_1
OBX|2|BUN|OBR_filename_1
OBX|3|CREATININE|OBR_filename_1
OBR|2|RFLX TO VERIFICATION|OBR_filename_2
OBX|1|EGFR|OBR_filename_2
OBX|2|SODIUM|OBR_filename_2
OBR|3|AMBIGUOUS DEFAULT|OBR_filename_3
OBX|1|POTASSIUM|OBR_filename_3

Upvotes: 0

Answers (2)

Aandal

Reputation: 51

I tried this, it looks like a HL file. you can use Stitch, Over & Lead and come up with something like this. Probably there might be a better solution than this from a performance standpoint. But this should work I guess, please let me know how it goes.

DEFINE Over org.apache.pig.piggybank.evaluation.Over('long');
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE lead org.apache.pig.piggybank.evaluation.Lead;

in = LOAD 'hl_file' using PigStorage('|') as (id:chararray, num:int, reason:chararray);
temp = rank in;
ranked = foreach temp generate $0 as row_no, $1 as id:chararray, $2 as orig_id:int, $3 as reason:chararray;
OBR_data = FILTER ranked by id == 'OBR';
next_row_num_OBR = FOREACH (group OBR_data by id) {
sorted = ORDER OBR_data by row_no;
stitched = Stitch(sorted, Over(sorted.row_no, 'lead',0,1,1,(long)9999));
generate flatten(group) as (id:chararray), 
flatten(stitched.(row_no, orig_id, reason, result)) as (row_no:long, orig_id:int, reason:chararray, next_row_no:long);
}
OBX_data = FILTER ranked by id == 'OBX';
Crossed = CROSS next_row_num_OBR, OBX_data;
result = FILTER Crossed BY (OBX_data::row_no > next_row_num_OBR::row_no and OBX_data::row_no < next_row_num_OBR::next_row_no);

This should produce something like this:

(OBR,5,2,RFLX TO VERIFICATION,8,7,OBX,2,SODIUM)

(OBR,1,1,METABOLIC PANEL,5,2,OBX,1,Glucose)

(OBR,5,2,RFLX TO VERIFICATION,8,6,OBX,1,EGFR)

(OBR,8,3,AMBIGUOUS DEFAULT,9999,9,OBX,1,POTASSIUM)

(OBR,1,1,METABOLIC PANEL,5,3,OBX,2,BUN)

(OBR,1,1,METABOLIC PANEL,5,4,OBX,3,CREATININE)

Instead of file name or a constant, it just adds the OBR record to its corresponding OBXs.

Upvotes: 1

nobody

Reputation: 11080

Use DISTINCT.Assuming you have relation A with duplicate records.The below statement will remove duplicate records and store the unique records in relation B

B = DISTINCT A;

Upvotes: 1

Removing duplicate pair in Pig

Answers (2)

Related Questions