Carlos Andres Castro
Carlos Andres Castro

Reputation: 173

Summing across groups in Pig Latin

I have data rows that look like this:

user    startTimestamp endTimestamp   durationForeground    application
11  1409239321000  1409239395000  1     IEXPLORE.EXE
11  1409239322000  0    19      IEXPLORE.EXE
11  1409239341000  0    18      IEXPLORE.EXE
11  1409239359000  0    0       IEXPLORE.EXE
11  1409239359000  0    7       IEXPLORE.EXE
11  1409239366000  0    6       IEXPLORE.EXE
11  1409239372000  0    10      IEXPLORE.EXE
11  1409239382000  0    13      IEXPLORE.EXE
11  1409239395000  1409239446000  9     MSPAINT.EXE
11  1409239404000  0    4       MSPAINT.EXE
11  1409239408000  0    13      MSPAINT.EXE
11  1409239421000  0    12      MSPAINT.EXE
11  1409239433000  0    5       MSPAINT.EXE
11  1409239438000  0    8       MSPAINT.EXE

I want to be able to sum all the durationForegrounds for each little group; where a group starts with a row that has an endTimestamp and finishes with the row just before the next start.

(The reason for this is that the difference between endTimestamp and startTimestamp give us the running time of the app, and the sum of durationForeground gives us the time the app was in the foreground.)

Can this be done with Pig?

Upvotes: 1

Views: 285

Answers (1)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

One option could be you need to group your data by user and application and get the sum of durationForeground.

Sample example

input

11      1409239321000   1409239395000   1       IEXPLORE.EXE
11      1409239322000   0       19      IEXPLORE.EXE
11      1409239341000   0       18      IEXPLORE.EXE
11      1409239359000   0       0       IEXPLORE.EXE
11      1409239359000   0       7       IEXPLORE.EXE
11      1409239366000   0       6       IEXPLORE.EXE
11      1409239372000   0       10      IEXPLORE.EXE
11      1409239382000   0       13      IEXPLORE.EXE
11      1409239395000   1409239446000   9       MSPAINT.EXE
11      1409239404000   0       4       MSPAINT.EXE
11      1409239408000   0       13      MSPAINT.EXE
11      1409239421000   0       12      MSPAINT.EXE
11      1409239433000   0       5       MSPAINT.EXE
11      1409239438000   0       8       MSPAINT.EXE

PigScript:

A = LOAD 'input' USING PigStorage() AS(user:int,startTimestamp:long,endTimestamp:long,durationForeground:long,application:chararray);
B = GROUP A BY (user,application);
C = FOREACH  B GENERATE FLATTEN(group),SUM(A.durationForeground);
DUMP C;

Output:

(11,MSPAINT.EXE,51)
(11,IEXPLORE.EXE,74)

In the above approach i assumed all the input fields are delimited by tab(\t).

Upvotes: 2

Related Questions