Reputation: 4186
I have a pandas DataFrame in python 3.
In this DataFrame there are rows which have identical values in two columns (this can be whole sections), I'll call this a group.
Each row also has a True
/False
value in a column.
Now for each row I want to know if any of the rows in its group have a False
value, if so, I want to assign a False
value to every row in that group in another column. I've managed to do this in a for-loop, but it's quite slow:
import pandas as pd
import numpy as np
df = pd.DataFrame({'E': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
Which gives:
df:
A B C D E
0 0 aa True 0 NaN
1 0 aa True 1 NaN
2 0 aa False 2 NaN
3 0 bb False 3 NaN
4 1 cc True 4 NaN
5 1 dd True 5 NaN
6 1 dd True 6 NaN
Now I run the for-loop:
for i in df.index:
df.ix[i, 'E'] = df[(df['A'] == df.iloc[i]['A']) & (df['B'] == df.iloc[i]['B'])]['C'].all()
which then gives the desired result:
df:
A B C D E
0 0 aa True 0 False
1 0 aa True 1 False
2 0 aa False 2 False
3 0 bb False 3 False
4 1 cc True 4 True
5 1 dd True 5 True
6 1 dd True 6 True
When running this for my entire DataFrame of ~1 million rows this takes ages. So, looking at using .apply()
to avoid a for-loop I've stumbled across the following question: apply a function to a pandas Dataframe whose retuned value is based on other rows
however:
def f(x): return False not in x
df.groupby(['A','B']).C.apply(f)
returns:
A B
0 aa False
bb True
1 cc True
dd True
Does anyone know a better way or how to fix the last case?
Upvotes: 0
Views: 369
Reputation: 585
You could try doing a SQL-style join using pd.merge
.
Perform the same groupby that you're doing, but apply min() to it to look for any cases with C == True. Then convert that to a DataFrame, rename the column as "E", and merge it back to df.
df = pd.DataFrame({'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
falses = pd.DataFrame(df.groupby(['A', 'B']).C.min() == True)
falses = falses.rename(columns={'C': 'E'})
df = df.merge(falses, left_on=['A', 'B'], right_index=True)
Upvotes: 1