user3031097
user3031097

Reputation: 177

Apache PIG - GROUP BY

I am looking to achieve the below functionality in Pig. I have a set of sample records like this.

enter image description here

Note that the EffectiveDate column is sometimes blank and also different for the same CustomerID.

Now, as output, I want one record per CustomerID where the EffectiveDate is the MAX. So, for the above example, i want the records highlighted as shown below.

enter image description here

The way I am doing it currently using PIG is this:

customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);

--Group customer data by CustomerID
customerdata_grpd = GROUP customerdata BY CustomerID;

--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;

--Join the above with the original data so that we get the other details like CustomerName, Age etc.
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);

finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;

I am basically grouping the original data to find the record with the maximum EffectiveDate. Then I join these 'grouped' records with the Original dataset again to get that same record with Max Effective date, but this time I will also get additional data like CustomerName, Age and Gender. This dataset is huge, so this approach is taking a lot of time. Is there a better approach?

Upvotes: 2

Views: 7629

Answers (1)

Murali Rao
Murali Rao

Reputation: 2287

Input :

1,John,28,M,1-Jan-15
1,John,28,M,1-Feb-15
1,John,28,M,
1,John,28,M,1-Mar-14
2,Jane,25,F,5-Mar-14
2,Jane,25,F,5-Jun-15
2,Jane,25,F,3-Feb-14

Pig Script :

customer_data = LOAD 'customer_data.csv' USING PigStorage(',')  AS  (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray);

customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date;

customer_data_grp_id = GROUP customer_data_fmt BY id;

req_data = FOREACH customer_data_grp_id {
        customer_data_ordered = ORDER customer_data_fmt BY date DESC;
        req_customer_data = LIMIT customer_data_ordered 1;
        GENERATE FLATTEN(req_customer_data.id) AS id, 
                 FLATTEN(req_customer_data.name) AS name,
                 FLATTEN(req_customer_data.gender) AS gender,
                 FLATTEN(req_customer_data.effective_date) AS effective_date;
};  

Output :

(1,John,M,1-Feb-15)
(2,Jane,F,5-Jun-15)

Upvotes: 5

Related Questions