Swier
Swier

Reputation: 4186

pandas Dataframe, assign value based on selection of other rows

I have a pandas DataFrame in python 3.

In this DataFrame there are rows which have identical values in two columns (this can be whole sections), I'll call this a group. Each row also has a True/False value in a column.

Now for each row I want to know if any of the rows in its group have a False value, if so, I want to assign a False value to every row in that group in another column. I've managed to do this in a for-loop, but it's quite slow:

import pandas as pd
import numpy as np

df = pd.DataFrame({'E': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
                   'D': [0, 1, 2, 3, 4, 5, 6],
                   'C': [True, True, False, False, True, True, True],
                   'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
                   'A': [0, 0, 0, 0, 1, 1, 1]})

Which gives:

df:
     A   B      C  D    E
  0  0  aa   True  0  NaN
  1  0  aa   True  1  NaN
  2  0  aa  False  2  NaN
  3  0  bb  False  3  NaN
  4  1  cc   True  4  NaN
  5  1  dd   True  5  NaN
  6  1  dd   True  6  NaN

Now I run the for-loop:

for i in df.index:
    df.ix[i, 'E'] = df[(df['A'] == df.iloc[i]['A']) & (df['B'] == df.iloc[i]['B'])]['C'].all()

which then gives the desired result:

df:
     A   B      C  D      E
  0  0  aa   True  0  False
  1  0  aa   True  1  False
  2  0  aa  False  2  False
  3  0  bb  False  3  False
  4  1  cc   True  4   True
  5  1  dd   True  5   True
  6  1  dd   True  6   True

When running this for my entire DataFrame of ~1 million rows this takes ages. So, looking at using .apply() to avoid a for-loop I've stumbled across the following question: apply a function to a pandas Dataframe whose retuned value is based on other rows

however:

def f(x): return False not in x
df.groupby(['A','B']).C.apply(f)

returns:

A  B 
0  aa    False
   bb     True
1  cc     True
   dd     True

Does anyone know a better way or how to fix the last case?

Upvotes: 0

Views: 369

Answers (1)

cggarvey
cggarvey

Reputation: 585

You could try doing a SQL-style join using pd.merge.

Perform the same groupby that you're doing, but apply min() to it to look for any cases with C == True. Then convert that to a DataFrame, rename the column as "E", and merge it back to df.

df = pd.DataFrame({'D': [0, 1, 2, 3, 4, 5, 6],
               'C': [True, True, False, False, True, True, True],
               'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
               'A': [0, 0, 0, 0, 1, 1, 1]})

falses = pd.DataFrame(df.groupby(['A', 'B']).C.min() == True)
falses = falses.rename(columns={'C': 'E'})

df = df.merge(falses, left_on=['A', 'B'], right_index=True)

Upvotes: 1

Related Questions