Reputation: 384
Assuming my data has the following structure :
Year | Location | New_client
2018 | Paris | true
2018 | Paris | true
2018 | Paris | false
2018 | London | true
2018 | Madrid | true
2018 | Madrid | false
2017 | Paris | true
I'm trying to calculate for each year and location the percentage of true value for New_client, so an example taking the records from the structure example would be
2018 | Paris | 66
2018 | London | 100
2018 | Madrid | 50
2017 | Paris | 100
Adapting from https://stackoverflow.com/a/13484279/2802552 my current script is but the difference is that instead of 1 column it's using 2 columns (Year and Location)
data = load...
grp = group inpt by Year; -- creates bags for each value in col1 (Year)
result = FOREACH grp {
total = COUNT(data);
t = FILTER data BY New_client == 'true'; --create a bag which contains only T values
GENERATE FLATTEN(group) AS Year, total AS TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
The problem is this uses Year as reference while I need it to be Year AND District.
Thanks for your help.
Upvotes: 0
Views: 1014
Reputation: 3454
Tested this code and looks working for me:
A = LOAD ...
B = GROUP A BY (year, location);
C = FOREACH B {
TRUE_CNT = FILTER A BY (chararray)new_client == 'true';
GENERATE group.year, group.location, (int)((float)COUNT(TRUE_CNT) / COUNT(A) * 100);
}
DUMP C;
(2017,Paris,100)
(2018,Paris,66)
(2018,London,100)
(2018,Madrid,50)
Upvotes: 0
Reputation: 722
You need to group by both Year
and Location
, which will require two modifications. First, add Location
to the group by statement. Second, change FLATTEN(group) AS Year
to FLATTEN(group) AS (Year, Location)
since group
is now a tuple with two fields.
grp = group inpt by (Year, Location);
result = FOREACH grp {
total = COUNT(inpt);
t = FILTER inpt BY New_client == 'true';
GENERATE
FLATTEN(group) AS (Year, Location),
total AS TOTAL_ROWS_IN_INPUT_TABLE,
100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
Upvotes: 2