the phoenix
the phoenix

Reputation: 739

Sum rows of grouped data frame based on a specific column

I have one data frame where I would like to create new column from the sum of different rows within one group df["NEW_Salary"]. If grouped df by the column Year, month & day , I want to for each group to sum the rows where Combination is True to the rows where Combination is False.

    import pandas as pd
    
 data = {"Year":[2002,2002,2002,2002,2002,2010,2010,2010,2010,2010],
        "Name":['Jason','Tom','KimJason','KimTom','Kim','Jason','Tom','KimJason','KimTom','Kim'],
        "Combination":[False,False,True,True,False,False,False,True,True,False],
        "Salary":[10,20,25,25,30,20,30,35,35,40]
        }
    df=pd.dataframe(data)

 Year  Month  Day        Name     Combination  Salary
0   2002      1   15     Jason        False      10
1   2002      1   15       Tom        False      20
2   2002      1   15  KimJason         True      25
3   2002      1   15    KimTom         True      25
4   2002      1   15       Kim        False      30
5   2010      3   20     Jason        False      20
6   2010      3   20       Tom        False      30
7   2010      3   20  KimJason         True      35
8   2010      3   20    KimTom         True      35
9   2010      3   20       Kim        False      40
10  2002      4    5      Mary        False      10
11  2002      4    5   MaryTom         True      20
12  2002      4    5       Tom        False      30

df["New_Salary"] would be created as following:

  1. The row where Name is KimJason,Salary would be added to the Salary rows where Name is Kim & Jason
  2. The row where Name is KimTom, Salary would be added again to the Salary rows where Name is Kim& Tom
  3. The rows of KimTom & KimJason would be the same in the new column NEW_Salary as in Salary

The expected output:

       Year  Month  Day   Name  Combination  Salary    NEW_Salary
0   2002      1   15     Jason        False      10          35
1   2002      1   15       Tom        False      20          45
2   2002      1   15  KimJason         True      25          25
3   2002      1   15    KimTom         True      25          25
4   2002      1   15       Kim        False      30          80
5   2010      3   20     Jason        False      20          55
6   2010      3   20       Tom        False      30          65
7   2010      3   20  KimJason         True      35          35
8   2010      3   20    KimTom         True      35          35
9   2010      3   20       Kim        False      40         110
10  2002      4    5      Mary        False      10          30
11  2002      4    5   MaryTom         True      20          20
12  2002      4    5       Tom        False      30          50

Is there an easy way to achieve this output? no matter how many groups I have ?

Upvotes: 1

Views: 70

Answers (1)

Rabinzel
Rabinzel

Reputation: 7913

Here`s how you can do it, as far I can tell, it should also work for any groups.

First extract all rows with combination names to a dictionary

lookup = dict(tuple(df.loc[df['Combination']==True].groupby('Name')[['Year', 'Salary']]))
for key,value in lookup.items():
    print(f"{key=}:\n{value}")

which looks like this: (value to each key is a df)

key='KimJason':
   Year  Salary
2  2002      25
7  2010      35
key='KimTom':
   Year  Salary
3  2002      25
8  2010      35

Then filter for rows where Combination is False, apply row by row the value of Salary and add all values which are found for that name and year in the lookup dictionary. At the end update the df with the new Salary values.

s = (df.loc[df['Combination']==False]
     .apply(lambda row: 
            row['Salary'] + sum(lookup[x].loc[lookup[x]['Year']==row['Year'], 'Salary'].squeeze()
                                for x in lookup 
                                if row['Name'] in x)
            ,axis=1)
    )
df['Salary'].update(s)
print(df)

Output df:

   Year      Name  Combination  Salary
0  2002     Jason        False      35
1  2002       Tom        False      45
2  2002  KimJason         True      25
3  2002    KimTom         True      25
4  2002       Kim        False      80
5  2010     Jason        False      55
6  2010       Tom        False      65
7  2010  KimJason         True      35
8  2010    KimTom         True      35
9  2010       Kim        False     110

Upvotes: 1

Related Questions