adaba
adaba

Reputation: 384

Calculate percentage on boolean column

Assuming my data has the following structure :

Year      | Location | New_client 

2018      | Paris    | true
2018      | Paris    | true
2018      | Paris    | false
2018      | London   | true
2018      | Madrid   | true
2018      | Madrid   | false
2017      | Paris    | true

I'm trying to calculate for each year and location the percentage of true value for New_client, so an example taking the records from the structure example would be

2018     | Paris    | 66
2018     | London   | 100
2018     | Madrid   | 50
2017     | Paris    | 100

Adapting from https://stackoverflow.com/a/13484279/2802552 my current script is but the difference is that instead of 1 column it's using 2 columns (Year and Location)

data = load...
grp = group inpt by Year; -- creates bags for each value in col1 (Year)
result = FOREACH grp {
    total = COUNT(data);
    t = FILTER data BY New_client == 'true'; --create a bag which contains only T values
    GENERATE FLATTEN(group) AS Year, total AS TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};

The problem is this uses Year as reference while I need it to be Year AND District.

Thanks for your help.

Upvotes: 0

Views: 1014

Answers (2)

Nazar Merza
Nazar Merza

Reputation: 3454

Tested this code and looks working for me:

A = LOAD ...
B = GROUP A BY (year, location);
C = FOREACH B  {
    TRUE_CNT = FILTER A BY (chararray)new_client == 'true';
    GENERATE group.year, group.location, (int)((float)COUNT(TRUE_CNT) / COUNT(A) * 100);
}

DUMP C;
(2017,Paris,100)
(2018,Paris,66)
(2018,London,100)
(2018,Madrid,50)

Upvotes: 0

savagedata
savagedata

Reputation: 722

You need to group by both Year and Location, which will require two modifications. First, add Location to the group by statement. Second, change FLATTEN(group) AS Year to FLATTEN(group) AS (Year, Location) since group is now a tuple with two fields.

grp = group inpt by (Year, Location);
result = FOREACH grp {
    total = COUNT(inpt);
    t = FILTER inpt BY New_client == 'true';
    GENERATE 
        FLATTEN(group) AS (Year, Location), 
        total AS TOTAL_ROWS_IN_INPUT_TABLE, 
        100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};

Upvotes: 2

Related Questions