Reputation: 141
I'm creating the charts in periscopedata and doing pandas to derive our results. I'm facing difficulties when removing duplicates from the results.
This is our data look like in final dataframe after calculating.
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 20 10 15
B2345 01/01/2015 15 50 20 45
B2345 02/01/2015 45 4 30 19
I want to remove the duplicate entry based on vendor_ID and date but keep the starting opening and keep the last entry closing
i.e) Expected result I want
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 70 30 45
B2345 02/01/2015 45 4 30 19
I've tried below code to remove the duplicates but that gave us different error.
df.drop_duplicates(subset=["vendor_ID", "date"], keep="last", inplace=True)
How do I code such way to remove the duplicates and keep the first and last as mentioned in above example.
Upvotes: 2
Views: 71
Reputation: 863226
Use GroupBy.agg
with GroupBy.first
, GroupBy.last
and GroupBy.sum
specified for each column for output:
Notice: Thanks @Erfan - if need use minimal and maximal column instead first and last change dict to {'opening':'min','purchase':'sum','paid':'sum', 'closing':'max'}
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 01/01/2015 5 70 30 45
1 B2345 02/01/2015 45 4 30 19
Also if not sure if datetimes are sorted:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19
You can also create dictionary dynamic for sum
all columns without first 2 and used for first and last:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
d = {'opening':'first', 'closing':'last'}
sum_cols = df.columns.difference(list(d.keys()) + ['vendor_ID','date'])
final_d = {**dict.fromkeys(sum_cols,'sum'), **d}
df1 = df.groupby(["vendor_ID", "date"], as_index=False).agg(final_d).reindex(df.columns,axis=1)
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19
Upvotes: 1