Reputation: 13
I have a data frame df=
Owner Manager Date Hours City
John Jerry 1/2/16 10 LA
John Jerry 1/2/16 10 SF
Mary Jerry 1/2/16 9 LA
Zach Joe 1/3/16 5 SD
Wendy Joe 1/3/16 4 SF
Hal Joe 1/4/16 2 SD
... 100,000 entries
I would like to group by 'Manager' and 'Date', then select unique values of 'Owner' and sum 'Hours' of that selection, finally transforming the sum to a new column 'Hours_by_Manager'.
My desired output is:
Owner Manager Date Hours City Hours_by_Manager
John Jerry 1/2/16 10 LA 19
John Jerry 1/2/16 10 SF 19
Mary Jerry 1/2/16 9 LA 19
Zach Joe 1/3/16 5 SD 9
Wendy Joe 1/3/16 4 SF 9
Hal Joe 1/4/16 2 SD 2
I tried using pandas 'groupby' like this:
df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Hours'].transform(lambda x: sum(x.unique()))
Which gives me what I want, but only because the value of hours is different between 'Owner'. What I'm looking for is something like this:
df['Hours_by_Manager']=df.groupby(['Manager','Date'])['Owner'].unique()['Hours']transform(lambda x: sum(x))
Which obviously is not syntactically correct. I know I could use for loops, but I would like to keep things vectorized. Any suggestions?
Upvotes: 0
Views: 728
Reputation: 880299
import pandas as pd
df = pd.DataFrame({'City': ['LA', 'SF', 'LA', 'SD', 'SF', 'SD'],
'Date': ['1/2/16', '1/2/16', '1/2/16', '1/3/16', '1/3/16', '1/4/16'],
'Hours': [10, 10, 9, 5, 4, 2],
'Manager': ['Jerry', 'Jerry', 'Jerry', 'Joe', 'Joe', 'Joe'],
'Owner': ['John', 'John', 'Mary', 'Zach', 'Wendy', 'Hal']})
uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
result = pd.merge(df, hours, how='left')
print(result)
yields
City Date Hours Manager Owner Hours_by_Manager
0 LA 1/2/16 10 Jerry John 19
1 SF 1/2/16 10 Jerry John 19
2 LA 1/2/16 9 Jerry Mary 19
3 SD 1/3/16 5 Joe Zach 9
4 SF 1/3/16 4 Joe Wendy 9
5 SD 1/4/16 2 Joe Hal 2
Explanation:
An Owner
on a given Date
works a unique number of Hours
. So let's first create a table of unique ['Hours','Owner','Date']
rows:
uniques = df.drop_duplicates(subset=['Hours','Owner','Date'])
# alternatively, uniques = df.groupby(['Hours','Owner','Date']).first().reset_index()
# City Date Hours Manager Owner
# 0 LA 1/2/16 10 Jerry John
# 2 LA 1/2/16 9 Jerry Mary
# 3 SD 1/3/16 5 Joe Zach
# 4 SF 1/3/16 4 Joe Wendy
# 5 SD 1/4/16 2 Joe Hal
Now we can group by ['Manager', 'Date']
and sum the Hours
:
hours = uniques.groupby(['Manager', 'Date'])['Hours'].sum().reset_index()
Manager Date Hours
0 Jerry 1/2/16 19
1 Joe 1/3/16 9
2 Joe 1/4/16 2
The hours['Hours']
column contains the values we want in df['Hours_by_Manager']
.
hours = hours.rename(columns={'Hours':'Hours_by_Manager'})
So now we can merge df
and hours
to obtain the desired result:
result = pd.merge(df, hours, how='left')
# City Date Hours Manager Owner Hours_by_Manager
# 0 LA 1/2/16 10 Jerry John 19
# 1 SF 1/2/16 10 Jerry John 19
# 2 LA 1/2/16 9 Jerry Mary 19
# 3 SD 1/3/16 5 Joe Zach 9
# 4 SF 1/3/16 4 Joe Wendy 9
# 5 SD 1/4/16 2 Joe Hal 2
Upvotes: 1