Reputation: 655
I have a dataframe which has "duplicate" rows in a way. Let's say i have a row A = ['name' : john, 'age' : 15, 'email' : NaN, 'school' : middle]
and a row B = ['name' : john, 'age' : 15, 'email' : [email protected], 'school' : NaN]
. The resulting rows for both A and B should be ['name' : john, 'age' : 15, 'email' : [email protected], 'school' : middle]
.
So far i have tried using iterrows() over a dataframe and changing the values but the changes don't save. My code:
duplicated = df[df.duplicated(['name', 'age'], keep = False)].sort_values('name')
row_iterator = duplicated.iterrows()
_, last = row_iterator.__next__()
for k, row in row_iterator:
if row['name'] == last['name']:
for i in duplicated.columns:
if row[i] == last[i]:
continue
if pd.isna(row[i]):
row[i] = last[i]
if pd.isna(last[i]):
last[i] = row[i]
last = row
df is the name of the dataframe where I have all the data. Then i cut only the duplicate rows into duplicated
. After that I iterate through the dataframe and try to make changes as I go. But the changes I make get lost or something in the end. What am I doing wrong?
Upvotes: 1
Views: 50
Reputation: 42886
Two ways we can solve your problem:
Method 1: using bfill
, ffill
and drop_duplicates
:
df = df.bfill().ffill().drop_duplicates()
name age email school
0 john 15 [email protected] middle
Method 2: GroupBy.first
:
df = df.groupby(['name', 'age']).first().reset_index()
name age email school
0 john 15 [email protected] middle
Upvotes: 2