Reputation: 47
I'm looking for a way to drop duplicate rows based one a certain column subset, but merge some data, so it does not get removed.
import pandas as pd
# Example Dataframe
data = {
"Parcel": ['001', '002', '003', '003'],
"Res" : ['Henry', 'Nick', 'Paul', 'Bill'],
"Bill" : ['4,100', '2,300', '5,200', '4,000'],
"Year" : ['1995', '1990', '2008', '2008']
}
df = pd.DataFrame.from_dict(data=data)
Parcel | Res | Bill | Year |
---|---|---|---|
001 | Henry | 4,100 | 1995 |
002 | Nick | 2,300 | 1990 |
003 | Paul | 5,200 | 2008 |
003 | Bill | 4,000 | 2008 |
Some pseudo code would look something like this:
df = df.drop_duplicates(subset='Parcel', keep_data=['Res', 'Bill'])
Parcel | Res | Bill | Year |
---|---|---|---|
001 | Henry | 4,100 | 1995 |
002 | Nick | 2,300 | 1990 |
003 | Paul, Bill | 5,200, 4,000 | 2008 |
I'm not sure where to begin with this, but any tips as to where to look would be appreciated.
Upvotes: 3
Views: 1402
Reputation: 195593
You can use .groupby
with .agg
:
df = (
df.groupby("Parcel")
.agg({"Res": ", ".join, "Bill": ", ".join, "Year": "first"})
.reset_index()
)
print(df)
Prints:
Parcel Res Bill Year
0 001 Henry 4,100 1995
1 002 Nick 2,300 1990
2 003 Paul, Bill 5,200, 4,000 2008
EDIT: If you have many columns, you can aggregate all values by "first" and then update the dataframe:
g = df.groupby("Parcel")
x = g.agg("first")
x.update(g.agg({"Res": ", ".join, "Bill": ", ".join}))
print(x.reset_index())
Parcel Res Bill Year
0 001 Henry 4,100 1995
1 002 Nick 2,300 1990
2 003 Paul, Bill 5,200, 4,000 2008
Upvotes: 4