Changes made while iterating over dataframe dont save

Question

I have a dataframe which has "duplicate" rows in a way. Let's say i have a row A = ['name' : john, 'age' : 15, 'email' : NaN, 'school' : middle] and a row B = ['name' : john, 'age' : 15, 'email' : john@gmail.com, 'school' : NaN]. The resulting rows for both A and B should be ['name' : john, 'age' : 15, 'email' : john@gmail.com, 'school' : middle].

So far i have tried using iterrows() over a dataframe and changing the values but the changes don't save. My code:

duplicated = df[df.duplicated(['name', 'age'], keep = False)].sort_values('name')
row_iterator = duplicated.iterrows()

_, last = row_iterator.__next__()
for k, row in row_iterator:
    if row['name'] == last['name']:
        for i in duplicated.columns:
            if row[i] == last[i]:
                continue
            if pd.isna(row[i]):
                row[i] = last[i]
            if pd.isna(last[i]):
                last[i] = row[i]
    last = row

df is the name of the dataframe where I have all the data. Then i cut only the duplicate rows into duplicated. After that I iterate through the dataframe and try to make changes as I go. But the changes I make get lost or something in the end. What am I doing wrong?

Erfan · Accepted Answer

Two ways we can solve your problem:

Method 1: using bfill, ffill and drop_duplicates:

df = df.bfill().ffill().drop_duplicates()

   name  age           email  school
0  john   15  john@gmail.com  middle

Method 2: GroupBy.first:

df = df.groupby(['name', 'age']).first().reset_index()

   name  age           email  school
0  john   15  john@gmail.com  middle

Changes made while iterating over dataframe dont save

Answers (1)

Related Questions