Bramat
Bramat

Reputation: 997

Concatenate different records in PIG script

I'm trying to write a script in PIG, and what I need to do is to take a data set - that contains user id, date, country code, and other attributes... my wanted result is to group by user id and date, and for each group like that - to concatenate the country codes in the same field...

For example:

user_id |  date    | country_code
1        2017-01-01     US
1        2017-01-01     UK
1        2017-01-02     FR
2        2017-01-02     RU
2        2017-01-03     DE
2        2017-01-03     AU

My wanted output:

(1, 2017-01-01, "US,UK")
(1, 2017-01-02, FR)
(2, 2017-01-02, RU)
(2, 2017-01-03, "DE,AU")

Upvotes: 2

Views: 516

Answers (1)

Dennis Jaheruddin
Dennis Jaheruddin

Reputation: 21563

A question with very different wording actually yielded this answer by @Hari Shankar, as the question does not appear to be a duplicate I will post the answer here directly:

grouped = GROUP table BY userid;
   X = FOREACH grouped GENERATE group as userid, 
                                table.clickcount as clicksbag, 
                                table.pagenumber as pagenumberbag;

Now X will be:

{(155,{(2),(3),(1)},{(12),(133),(144)},
 (156,{(6),(7)},{(1),(5)}}

Now you need to use the builtin UDF BagToTuple:

output = FOREACH X GENERATE userid, 
                            BagToTuple(clickbag) as clickcounts, 
                            BagToTuple(pagenumberbag) as pagenumbers;

output should now contain what you want. You can merge the output step into the merge step as well:

    output = FOREACH grouped GENERATE group as userid, 
                     BagToTuple(table.clickcount) as clickcounts, 
                     BagToTuple(table.pagenumber) as pagenumbers;

1: http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/builtin/BagToTuple.html

Upvotes: 2

Related Questions