H.Burns
H.Burns

Reputation: 419

How to find average of a column and average of subtraction of two columns in Pig?

I am new to scripting using Pig Latin. I am stuck to write a pig script which will find the average of a column value and also to find the average of the subtracted values between two columns.

I am reading the data from a csv file having starttime and endtime columns as below:

"starttime","endtime",
"23","46",
"32","49",
"54","59"

The code that I have tried so far is as below :

file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
start_ts = FOREACH file GENERATE st;
grouped = group start_ts by st
ILLUSTRATE grouped

The ILLUSTRATE output I am getting is as below and I am not able to apply the AVG function.

------------------------------------------
-------------------------------------------------------------------------------------
| grouped     | group:int     | file:bag{:tuple(st:int,et:int)}                 | 
-------------------------------------------------------------------------------------
|             |               | {(, ), (, )}                                        | 
-------------------------------------------------------------------------------------

Can anybody please help me getting the average of the starttime which would be the result of (23 + 32 + 54)/3

And also some ideas on how to code the (endtime -starttime)/no. of records (i.e 3 in this case) would be of great help for me to get started.

Thanks.

Upvotes: 0

Views: 905

Answers (3)

nobody
nobody

Reputation: 11080

First ensure that you are loading the data correctly.Looks like you have double quotes i.e " around your data.Load the data as chararray,replace the double quotes and then cast it to int,finally apply the AVG function for the starttime.For the avg of endtime - starttime just subtract the 2 fields and apply AVG.

A = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
B = FOREACH A GENERATE (int)REPLACE(st,'\\"','') as st,(int)REPLACE(et,'\\"','') as et;
C = GROUP B ALL;
D = FOREACH C GENERATE AVG(B.st),AVG(B.et - B.st);

Upvotes: 1

H.Burns
H.Burns

Reputation: 419

Thanks to inquisitive_mind. My answer is majorly based on his answer with a little tweak. This is only for the average of one column.

 file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
cols  = FOREACH file GENERATE (int)REPLACE(st, '"', '') as st, (int)REPLACE(et, '"', '') as et;

grp_cols = GROUP cols all;


 avg = FOREACH grp_cols GENERATE AVG(cols.st);

DUMP avg

Upvotes: 0

Vikas Madhusudana
Vikas Madhusudana

Reputation: 1482

Try this

file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
grouped = group file by 1
AVG = foreach grouped generate AVG(file.st)

Upvotes: 0

Related Questions