Reputation: 251
at the moment I am working on a project and ran in some efficency issue. At some point I have a huge dataframe and need to drop rows. My dataframe contains two columns where the value is either '0', '1' or an other np.nan.
I need to drop all rows where the first column is '0' and the other is '1' or vice versa. I want to keep all other combinations.
I wrote a function that does the job but it is highly inefficient and takes ages. Here is an example of an input, my function and the expected output.
data = {'val': ['a', 'b', 'c'],
'compare1': ['0', '0', '1'],
'compare2': ['0', None, '0']
}
df = pd.DataFrame (data)
df
> val compare1 compare2
> 0 a 0 0
> 1 b 0 None
> 2 c 1 0
my function:
def filter_df(df, colname1, colname2):
for index, row in df.iterrows():
if (row[colname1] == "0") and (row[colname2] == "1"):
df.drop(index, inplace=True)
if (row[colname1] == "1") and (row[colname2] == "0"):
df.drop(index, inplace=True)
return df
the output I want to have and get with the fkt:
df_new = filter_df(df,'compare1', 'compare2')
df_new
> val compare1 compare2
> 0 a 0 0
> 1 b 0 None
If you have an idea to make it more eficient please let me know :-). I am very happy for all kind of help.
Best P
Upvotes: 1
Views: 43
Reputation: 28644
You could convert the dataframe to numeric types :
df = df.transform(pd.to_numeric, errors="ignore")
df
val compare1 compare2
0 a 0 0.0
1 b 0 NaN
2 c 1 0.0
Then filter where the sum of the compare
columns is equal to 1 :
df.loc[~df.compare1.add(df.compare2).eq(1)]
val compare1 compare2
0 a 0 0.0
1 b 0 NaN
Upvotes: 1
Reputation: 5745
you can use conditions like this:
mask1 = ~((df['compare1'] == "0") & (df['compare2'] == "1"))
mask2 = ~((df['compare1'] == "1") & (df['compare2'] == "0"))
df = df[mask1 & mask2 ]
if the performance is still not efficient for you I would suggest you to move to numpy for better performance...
Upvotes: 1