andrebo7
andrebo7

Reputation: 13

Pandas groupby multiple keys selecting unique values and transforming

I have a data frame df=

Owner     Manager     Date      Hours  City
 John      Jerry       1/2/16     10    LA
 John      Jerry       1/2/16     10    SF
 Mary      Jerry       1/2/16     9     LA
 Zach      Joe         1/3/16     5     SD
 Wendy     Joe         1/3/16     4     SF
 Hal       Joe         1/4/16     2     SD

... 100,000 entries

I would like to group by 'Manager' and 'Date', then select unique values of 'Owner' and sum 'Hours' of that selection, finally transforming the sum to a new column 'Hours_by_Manager'.

My desired output is:

Owner     Manager     Date      Hours  City   Hours_by_Manager
   John      Jerry       1/2/16     10    LA   19
   John      Jerry       1/2/16     10    SF   19
   Mary      Jerry       1/2/16     9     LA   19
   Zach      Joe         1/3/16     5     SD   9
   Wendy     Joe         1/3/16     4     SF   9
   Hal       Joe         1/4/16     2     SD   2

I tried using pandas 'groupby' like this:

df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Hours'].transform(lambda x: sum(x.unique()))

Which gives me what I want, but only because the value of hours is different between 'Owner'. What I'm looking for is something like this: df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Owner'].unique()['Hours']transform(lambda x: sum(x)) Which obviously is not syntactically correct. I know I could use for loops, but I would like to keep things vectorized. Any suggestions?

Upvotes: 0

Views: 728

Answers (1)

unutbu
unutbu

Reputation: 880299

import pandas as pd
df = pd.DataFrame({'City': ['LA', 'SF', 'LA', 'SD', 'SF', 'SD'],
    'Date': ['1/2/16', '1/2/16', '1/2/16', '1/3/16', '1/3/16', '1/4/16'],
    'Hours': [10, 10, 9, 5, 4, 2],
    'Manager': ['Jerry', 'Jerry', 'Jerry', 'Joe', 'Joe', 'Joe'],
    'Owner': ['John', 'John', 'Mary', 'Zach', 'Wendy', 'Hal']})

uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
result = pd.merge(df, hours, how='left')
print(result)

yields

  City    Date  Hours Manager  Owner  Hours_by_Manager
0   LA  1/2/16     10   Jerry   John                19
1   SF  1/2/16     10   Jerry   John                19
2   LA  1/2/16      9   Jerry   Mary                19
3   SD  1/3/16      5     Joe   Zach                 9
4   SF  1/3/16      4     Joe  Wendy                 9
5   SD  1/4/16      2     Joe    Hal                 2

Explanation:

An Owner on a given Date works a unique number of Hours. So let's first create a table of unique ['Hours','Owner','Date'] rows:

uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
# alternatively, uniques = df.groupby(['Hours','Owner','Date']).first().reset_index()
#   City    Date  Hours Manager  Owner
# 0   LA  1/2/16     10   Jerry   John
# 2   LA  1/2/16      9   Jerry   Mary
# 3   SD  1/3/16      5     Joe   Zach
# 4   SF  1/3/16      4     Joe  Wendy
# 5   SD  1/4/16      2     Joe    Hal

Now we can group by ['Manager', 'Date'] and sum the Hours:

hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
  Manager    Date  Hours
0   Jerry  1/2/16     19
1     Joe  1/3/16      9
2     Joe  1/4/16      2

The hours['Hours'] column contains the values we want in df['Hours_by_Manager'].

hours = hours.rename(columns={'Hours':'Hours_by_Manager'})

So now we can merge df and hours to obtain the desired result:

result = pd.merge(df, hours, how='left')
#   City    Date  Hours Manager  Owner  Hours_by_Manager
# 0   LA  1/2/16     10   Jerry   John                19
# 1   SF  1/2/16     10   Jerry   John                19
# 2   LA  1/2/16      9   Jerry   Mary                19
# 3   SD  1/3/16      5     Joe   Zach                 9
# 4   SF  1/3/16      4     Joe  Wendy                 9
# 5   SD  1/4/16      2     Joe    Hal                 2

Upvotes: 1

Related Questions