kskp
kskp

Reputation: 712

Pig: Count frequency of multiple columns

I want to calculate the frequency of 2 field-combinations in pig:

------ y1 has the fields -----
a1 = GROUP y1 BY (user_id, tweet_created_at);
a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
a3 = FOREACH a2 GENERATE user_id, tweet_created_at, number_of_replies_by_user;
a4 = JOIN y1 BY (user_id, tweet_created_at) LEFT OUTER, a3 BY (user_id, tweet_created_at);

In the above, I want to calculate the frequency of the (user_id, tweet_created_at) field combination.

The line a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user; gives an error: Incompatable schema: left is "user_id:NULL,tweet_created_at:NULL", right is "group:tuple(user_id:bytearray,tweet_created_at:bytearray)"

I tried without the brackets: a2 = FOREACH a1 GENERATE group AS user_id, tweet_created_at, COUNT(y1) AS number_of_replies_by_user;

I get another error:

Invalid field projection. Projected field [tweet_created_at] does not exist in schema:..................

Is that a syntax error or an issue with my data? What is the right way if it a syntax error?

In short: I want to calculate the number of replies a user gave at the time of every tweet posted. (If he posted 2 tweets on the same day, he might have reply count 10 at the time of first tweet and 15 at the time of second). I guess if I don't group by tweet_created_at, the reply count will be a constant at all times which is wrong.

Upvotes: 0

Views: 686

Answers (1)

nobody
nobody

Reputation: 11090

Use FLATTEN on the group to unnest the tuple into the fields

a2 = FOREACH a1 GENERATE FLATTEN(group) AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;

Upvotes: 2

Related Questions