Pandas, drop duplicates but merge certain columns

Question

I'm looking for a way to drop duplicate rows based one a certain column subset, but merge some data, so it does not get removed.

import pandas as pd

# Example Dataframe
data = {
    "Parcel": ['001', '002', '003', '003'],
    "Res"   : ['Henry', 'Nick', 'Paul', 'Bill'],
    "Bill"  : ['4,100', '2,300', '5,200', '4,000'],
    "Year"  : ['1995', '1990', '2008', '2008']
}
df = pd.DataFrame.from_dict(data=data)

Parcel	Res	Bill	Year
001	Henry	4,100	1995
002	Nick	2,300	1990
003	Paul	5,200	2008
003	Bill	4,000	2008

Some pseudo code would look something like this:

df = df.drop_duplicates(subset='Parcel', keep_data=['Res', 'Bill'])

Parcel	Res	Bill	Year
001	Henry	4,100	1995
002	Nick	2,300	1990
003	Paul, Bill	5,200, 4,000	2008

I'm not sure where to begin with this, but any tips as to where to look would be appreciated.

Andrej Kesely · Accepted Answer

You can use .groupby with .agg:

df = (
    df.groupby("Parcel")
    .agg({"Res": ", ".join, "Bill": ", ".join, "Year": "first"})
    .reset_index()
)
print(df)

Prints:

  Parcel         Res          Bill  Year
0    001       Henry         4,100  1995
1    002        Nick         2,300  1990
2    003  Paul, Bill  5,200, 4,000  2008

EDIT: If you have many columns, you can aggregate all values by "first" and then update the dataframe:

g = df.groupby("Parcel")
x = g.agg("first")
x.update(g.agg({"Res": ", ".join, "Bill": ", ".join}))
print(x.reset_index())

  Parcel         Res          Bill  Year
0    001       Henry         4,100  1995
1    002        Nick         2,300  1990
2    003  Paul, Bill  5,200, 4,000  2008

Pandas, drop duplicates but merge certain columns

Answers (1)

Related Questions