DataFrame fastest way to update rows without a loop

Question

Creating a scenario:

Assuming a dataframe with two series, where A is the input and B is the result of A[index]*2:

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [2, 4, 6]})

Lets say I am receiving a 100k row dataframe and searching for errors in it (here B->0 is invalid):

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [2, 0, 6]})

Searching the invalid rows by using

invalid_rows = df.loc[df['A']*2 != df['B']]

I have the invalid_rows now, but I am not sure what would be the fastest way to overwrite the invalid rows in the original df with the result of A[index]*2?

Iterating over the df using iterrows() is an option but slow if the df grows. Can I use df.update() for this somehow?

Working solution with a loop:

index = -1
for row_index, my_series in df.iterrows():
  if myseries['A']*2 != myseries['B']:
    df[index]['B'] = myseries['A']*2

But is there a faster way to do this?

Erfan · Accepted Answer

Using mul, ne and loc:

m = df['A'].mul(2).ne(df['B'])
# same as: m = df['A'] * 2 != df['B']
df.loc[m, 'B'] = df['A'].mul(2)

   A  B
0  1  2
1  2  4
2  3  6

m returns a boolean series which marks the row where A * 2 != B

print(m)

0    False
1     True
2    False
dtype: bool

DataFrame fastest way to update rows without a loop

Answers (1)

Related Questions