King Linux
King Linux

Reputation: 11

SUM function in Pig script

I am a student learning how to use Pig script using the hortonworks sandbox. My problem is that I am not able to use the SUM function properly. I have successfully separated the fields of a firewall log and I am able to do perform several queries and use the count function... but no luck with the SUM function which I really need in one case. This code I used below:

A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
counter = foreach grpd1 {
    sum1 = SUM(A.rcvd);
    sum2 = SUM(A.sent);
    generate sum1, sum2;
};
dump counter;
C = foreach F1 generate rcvd, sent;
dump C;

When I dump just the variable C I get a result displaying many records indicating the amount of data received/sent for the filter applied. eg:

(223,123)
(334,444)
(21,12344)
(...,...)

All I really want to do is add all those records together and show that total amount of received and sent: (?,?).

Note: I have tried changing the variable type to int, long, and chararray with no success either.

Some of the errors I am getting while trying to solve this are:

Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.

Upvotes: 1

Views: 9678

Answers (3)

user3558609
user3558609

Reputation: 1

Please try the following

A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C = foreach F1 generate group,SUM(F1.rcvd), SUM(F1.sent);
dump C;

Upvotes: 0

dimzak
dimzak

Reputation: 2571

A lucky guess here, I'm new to Pig too :)
I'm not sure if SUM can be casted to chararray(that would explain the error), so make rcvd and sent type:int and then generate the 2 sums for grpd1 bag:

 F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
    grpd1 = group F1 by user; 
    C1 = foreach grpd1 generate SUM(F1.rcvd);
    dump C1;
    C2 = foreach grpd1 generate SUM(F1.sent);
    dump C2;

NOTE: More info here.

Hope I helped a little!

Upvotes: 1

Parag
Parag

Reputation: 175

First make sure that the fields that you are summing up are of type int

Use - DESCRIBE A; to check the data type After that, I think since you have used filter condition and then used group by on F1 -

F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;

So, while summing up you should use F1 instead of A -

counter = foreach grpd1 {
    sum1 = SUM(F1.rcvd);
    sum2 = SUM(F1.sent);
    generate sum1, sum2;
};

Use DESCRIBE grpd1; and you will understand what I am trying to say, there will be no 'A' I guess this should solve the error. Finally, check the logic of what you want in the result I have not checked that. Hope this helps. PS - I am also a student and new to PIG.

Upvotes: 1

Related Questions