Justin
Justin

Reputation: 115

Pandas - Combining duplicate lines into one

So below I have an example of a DataFrame where throughout, there will be multiple instances like the SALES TAX EXPENSE lines, where if there is a duplicate, it will need to be converted into one line where the total of Trans_Amt should be Trans_Type C - Trans_Type D.

So for example in this DF, there should only be one line for SALES TAX EXPENSE, and the total should be -36239.65.

This happens multiple times throughout the DF, with multiple different ActNames. I'm looking for insight as to the most efficient way to do this transformation and apply it to any instance where this occurs.

Thank you!

              ActName  ActCode Trans_Type   Trans_Amt
0               SALES      401          C  2082748.85
1   SALES TAX EXPENSE      407          C   100000.00
30          DISCOUNTS      405          D     -654.59
31  SALES TAX EXPENSE      407          D   136239.65

Upvotes: 1

Views: 41

Answers (1)

Vaishali
Vaishali

Reputation: 38415

Group data by columns and take assign difference to Amt. Then drop duplicates.

df['Trans_Amt'] = df.groupby(['ActName','ActCode']).Trans_Amt.apply(lambda x: x.diff(periods=-1)).combine_first(df['Trans_Amt'])
df.drop_duplicates('ActName')

    ActName             ActCode Trans_Type  Trans_Amt
0   SALES               401     C           2082748.85
1   SALES TAX EXPENSE   407     C           -36239.65
30  DISCOUNTS           405     D           -654.59

Edit: Based on follow-up question. If the difference should be with the previous row, try

df['Trans_Amt'] = df.groupby(['ActName','ActCode']).Trans_Amt.apply(lambda x: x.diff()).combine_first(df['Trans_Amt'])
df.drop_duplicates('ActName', keep='last')

    ActName             ActCode Trans_Type  Trans_Amt
0   SALES               401     C           2082748.85
30  DISCOUNTS           405     D           -654.59
31  SALES TAX EXPENSE   407     D           36239.65

Upvotes: 2

Related Questions