Evan Zamir
Evan Zamir

Reputation: 8481

Join on multiple fields in Pig

I'm learning Pig and not sure how to do the following. I have on file that stores a series of metadata about chat messages:

12345 13579
23456 24680
19350 20283
28394 20384
10384 29475
.
.
.

The first column is the id of the sender and the second column is the id of the receiver. What I want to do is count how many messages are being sent from men to women, men to men, women to men, and women to women. So I have another file which stores user id's and gender:

12345 M
23456 F
34567 M
45678 M
.
.
.

So the Pig script might start out as follows:

messages = load 'messages.txt' as (from:int, to:int);
users = load 'users.txt' as (id:int,sex:chararray);

From there I'm really not sure what the next step to take should be. I was able to join one column at a time of messages to users, but not sure how to join both columns and then do the subsequent grouping.

Any advice/tips would be super helpful.

Upvotes: 0

Views: 474

Answers (1)

kevad
kevad

Reputation: 419

I guess what you want is to join then group and count your data.

joinedSenderRaw = JOIN users BY id, messages BY from;

joinedSender = FOREACH joinedSenderRaw
    GENERATE messages::from as sender_id,
             users::sex as sender_sex,
             messages::to as receiver_id;

joinedAllRaw = JOIN joinedSender BY receiver_id, users BY id;

joinedAll = FOREACH joinedAllRaw
    GENERATE joinedSender::sender_id,
             joinedSender::sender_sex,
             joinedSender::receiver_id,
             users::sex as receiver_sex;

grouped = GROUP joinedAll BY (sender_sex, receiver_sex);

result = FOREACH grouped
    GENERATE $0.sender_sex AS sender_sex,
             $0.receiver_sex AS receiver_sex,
             COUNT($1) AS your_stat;

I did not test it but something like this should work.

Upvotes: 1

Related Questions