How to remove duplicate entries but keep the first row selected columns value and last row selected columns value?

Question

I'm creating the charts in periscopedata and doing pandas to derive our results. I'm facing difficulties when removing duplicates from the results.

This is our data look like in final dataframe after calculating.

vendor_ID date        opening purchase paid closing
B2345     01/01/2015  5       20       10   15
B2345     01/01/2015  15      50       20   45
B2345     02/01/2015  45      4        30   19

I want to remove the duplicate entry based on vendor_ID and date but keep the starting opening and keep the last entry closing

i.e) Expected result I want

vendor_ID date        opening purchase paid closing
B2345     01/01/2015  5       70       30   45
B2345     02/01/2015  45      4        30   19

I've tried below code to remove the duplicates but that gave us different error.

df.drop_duplicates(subset=["vendor_ID", "date"], keep="last", inplace=True)

How do I code such way to remove the duplicates and keep the first and last as mentioned in above example.

jezrael · Accepted Answer

Use GroupBy.agg with GroupBy.first, GroupBy.last and GroupBy.sum specified for each column for output:

Notice: Thanks @Erfan - if need use minimal and maximal column instead first and last change dict to {'opening':'min','purchase':'sum','paid':'sum', 'closing':'max'}

df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
         .agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
  vendor_ID        date  opening  purchase  paid  closing
0     B2345  01/01/2015        5        70    30       45
1     B2345  02/01/2015       45         4    30       19

Also if not sure if datetimes are sorted:

df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])

df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
         .agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
  vendor_ID       date  opening  purchase  paid  closing
0     B2345 2015-01-01        5        70    30       45
1     B2345 2015-01-02       45         4    30       19

You can also create dictionary dynamic for sum all columns without first 2 and used for first and last:

df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])

d = {'opening':'first', 'closing':'last'}
sum_cols = df.columns.difference(list(d.keys()) + ['vendor_ID','date'])

final_d = {**dict.fromkeys(sum_cols,'sum'), **d}
df1 = df.groupby(["vendor_ID", "date"], as_index=False).agg(final_d).reindex(df.columns,axis=1)
print (df1)
  vendor_ID       date  opening  purchase  paid  closing
0     B2345 2015-01-01        5        70    30       45
1     B2345 2015-01-02       45         4    30       19

How to remove duplicate entries but keep the first row selected columns value and last row selected columns value?

Answers (1)

Related Questions