Reputation: 173
I have data rows that look like this:
user startTimestamp endTimestamp durationForeground application
11 1409239321000 1409239395000 1 IEXPLORE.EXE
11 1409239322000 0 19 IEXPLORE.EXE
11 1409239341000 0 18 IEXPLORE.EXE
11 1409239359000 0 0 IEXPLORE.EXE
11 1409239359000 0 7 IEXPLORE.EXE
11 1409239366000 0 6 IEXPLORE.EXE
11 1409239372000 0 10 IEXPLORE.EXE
11 1409239382000 0 13 IEXPLORE.EXE
11 1409239395000 1409239446000 9 MSPAINT.EXE
11 1409239404000 0 4 MSPAINT.EXE
11 1409239408000 0 13 MSPAINT.EXE
11 1409239421000 0 12 MSPAINT.EXE
11 1409239433000 0 5 MSPAINT.EXE
11 1409239438000 0 8 MSPAINT.EXE
I want to be able to sum all the durationForegrounds for each little group; where a group starts with a row that has an endTimestamp and finishes with the row just before the next start.
(The reason for this is that the difference between endTimestamp and startTimestamp give us the running time of the app, and the sum of durationForeground gives us the time the app was in the foreground.)
Can this be done with Pig?
Upvotes: 1
Views: 285
Reputation: 4724
One option could be you need to group your data by user
and application
and get the sum of durationForeground
.
Sample example
input
11 1409239321000 1409239395000 1 IEXPLORE.EXE
11 1409239322000 0 19 IEXPLORE.EXE
11 1409239341000 0 18 IEXPLORE.EXE
11 1409239359000 0 0 IEXPLORE.EXE
11 1409239359000 0 7 IEXPLORE.EXE
11 1409239366000 0 6 IEXPLORE.EXE
11 1409239372000 0 10 IEXPLORE.EXE
11 1409239382000 0 13 IEXPLORE.EXE
11 1409239395000 1409239446000 9 MSPAINT.EXE
11 1409239404000 0 4 MSPAINT.EXE
11 1409239408000 0 13 MSPAINT.EXE
11 1409239421000 0 12 MSPAINT.EXE
11 1409239433000 0 5 MSPAINT.EXE
11 1409239438000 0 8 MSPAINT.EXE
PigScript:
A = LOAD 'input' USING PigStorage() AS(user:int,startTimestamp:long,endTimestamp:long,durationForeground:long,application:chararray);
B = GROUP A BY (user,application);
C = FOREACH B GENERATE FLATTEN(group),SUM(A.durationForeground);
DUMP C;
Output:
(11,MSPAINT.EXE,51)
(11,IEXPLORE.EXE,74)
In the above approach i assumed all the input fields are delimited by tab(\t).
Upvotes: 2