Reputation: 712
I want to calculate the frequency of 2 field-combinations in pig:
------ y1 has the fields -----
a1 = GROUP y1 BY (user_id, tweet_created_at);
a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
a3 = FOREACH a2 GENERATE user_id, tweet_created_at, number_of_replies_by_user;
a4 = JOIN y1 BY (user_id, tweet_created_at) LEFT OUTER, a3 BY (user_id, tweet_created_at);
In the above, I want to calculate the frequency of the (user_id, tweet_created_at
) field combination.
The line a2 = FOREACH a1 GENERATE group AS (user_id, tweet_created_at), COUNT(y1) AS number_of_replies_by_user;
gives an error: Incompatable schema: left is "user_id:NULL,tweet_created_at:NULL", right is "group:tuple(user_id:bytearray,tweet_created_at:bytearray)"
I tried without the brackets: a2 = FOREACH a1 GENERATE group AS user_id, tweet_created_at, COUNT(y1) AS number_of_replies_by_user;
I get another error:
Invalid field projection. Projected field [tweet_created_at] does not exist in schema:..................
Is that a syntax error or an issue with my data? What is the right way if it a syntax error?
In short: I want to calculate the number of replies a user gave at the time of every tweet posted. (If he posted 2 tweets on the same day, he might have reply count 10 at the time of first tweet and 15 at the time of second). I guess if I don't group by tweet_created_at
, the reply count will be a constant at all times which is wrong.
Upvotes: 0
Views: 686