IUF
IUF

Reputation: 323

How to count and then compute the total average in PIG

Each line in my dataset is a sale and my goal is to compute the average time a client buys during his lifetime.

I have already grouped and counted by clientId like this:

byClientId = GROUP sales BY clientId;
countByClientId = FOREACH byClientId GENERATE group, count($1);

This creates a table with 2 columns: clientId, count of transactions.

Now, I am trying to get the total average of the second column (i.e. the overall average of sales to same client). I am using this code:

groupCount = GROUP countByClientId all;
avg = foreach groupCount generate AVG($1);

But I get this error message:

[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: 
<line 18, column 31> Could not infer the matching function for org.apache.pig.builtin.AVG 
as multiple or none of them fit. Please use an explicit cast.

How to get the overall average of the second column?

Upvotes: 0

Views: 437

Answers (1)

AntonyBrd
AntonyBrd

Reputation: 403

It would have been simpler for us with a sample of input data.. I created my own, to be sure that my solution would work. You only have one mistake : once you grouped all your schema become group:chararray,countByClientId:bag{:tuple(group:chararray,:long)}

So, $1 refers to a bag and this is why you can't compute the mean. If you want to access $1 (which is the second element) inside this bag you have two choices, either $1.$1, or countByClientId.$1. So your last line should be :

avg = foreach groupCount generate AVG(countByClientId.$1);

I hope it's clear.

Upvotes: 1

Related Questions