Set value based on condition on multiple rows and columns Pandas

Question

I have a huge dataframe, around 450K rows. This dataframe, contains a house ID which then acts as an identifier for a household. For example :

House_ID       Resident_name     SSN_ID        Occupation      Offer
9211           Aaron             1122          Unemployed      No_offer
9211           Emelda            9831          Unemployed      No_offer
9211           Brandt            9744          Prosecutor      Household_offer
9080           Elise             8903          Teacher         No_offer
9531           Ryan              7856          Unemployed      Household_offer
9531           Gillian           9002          Unemployed      Household_offer

In the example dataframe, there are 2 households and 1 individual. My aim is to modify/alter column 'Offer' based on column 'Occupation' in a household.

The logics are as below :

if ALL members in the household are 'Unemployed' then 'Offer' = 'No_offer'
if ONLY SOME members in the household are 'Unemployed' then 'Offer' = 'Household_offer'

The desired output is as below :

House_ID       Resident_name     SSN_ID        Occupation      Offer
9211           Aaron             1122          Unemployed      Household_offer
9211           Emelda            9831          Unemployed      Household_offer
9211           Brandt            9744          Prosecutor      Household_offer
9080           Elise             8903          Teacher         Household_offer
9531           Ryan              7856          Unemployed      No_offer
9531           Gillian           9002          Unemployed      No_offer

The reason Aaron and Emelda received a household offer (even when they're both unemployed) is that the other member of their household is still employed, while Ryan's household members receives no offer because none of them are employed.

My current solution is using grouby and loop, where I group each household using the household ID then go thru each member employment status before assigning the offer, as below :

df = HouseDF.groupby('house_ID')
mergedDF = pd.DataFrame()
for i in range(len(list(df.groups))) :
    tempDF = df.get_group(list(df.groups)[i])
    employment_list = tempDF['Occupation'].unique().tolist()
    n_occupation = len(tempDF['Occupation'].unique().tolist())
    if ((n_occupation == 1) & ('Unemployment' in employment_list )):
        tempDF['Offer'] = 'No_offer'
    mergedDF = pd.concat([mergedDF,tempDF])

While the solution barely works, it takes too long to finish due to the iteration since the original dataset has around 200K house_ID (but 450K SSN_ID), and looping thru those households (and iterate each of its members to check their employment status) will be too time-consuming and inefficient.

Is there any solution that is much more efficient for setting a value based on the condition of multiple rows and columns?

Thank you~

Henry Ecker · Accepted Answer

Try groupby transform to check each household with all to check if all members of the household are unemployed, then np.where to assign the 'No_offer', 'Household_offer' values:

import numpy as np
import pandas as pd

HouseDF = pd.DataFrame({
    'House_ID': {0: 9211, 1: 9211, 2: 9211, 3: 9080, 4: 9531, 5: 9531},
    'Resident_name': {0: 'Aaron', 1: 'Emelda', 2: 'Brandt', 3: 'Elise',
                      4: 'Ryan', 5: 'Gillian'},
    'SSN_ID': {0: 1122, 1: 9831, 2: 9744, 3: 8903, 4: 7856, 5: 9002},
    'Occupation': {0: 'Unemployed', 1: 'Unemployed', 2: 'Prosecutor',
                   3: 'Teacher', 4: 'Unemployed', 5: 'Unemployed'},
    'Offer': {0: 'No_offer', 1: 'No_offer', 2: 'Household_offer',
              3: 'No_offer', 4: 'Household_offer', 5: 'Household_offer'}
})

all_unemployed = HouseDF.groupby('House_ID')['Occupation'] \
    .transform(lambda o: o.eq('Unemployed').all())

HouseDF['Offer'] = np.where(all_unemployed, 'No_offer', 'Household_offer')

print(HouseDF)

HouseDF:

   House_ID Resident_name  SSN_ID  Occupation            Offer
0      9211         Aaron    1122  Unemployed  Household_offer
1      9211        Emelda    9831  Unemployed  Household_offer
2      9211        Brandt    9744  Prosecutor  Household_offer
3      9080         Elise    8903     Teacher  Household_offer
4      9531          Ryan    7856  Unemployed         No_offer
5      9531       Gillian    9002  Unemployed         No_offer

Groupby Transform without np.where option:

HouseDF['Offer'] = HouseDF.groupby('House_ID')['Occupation'].transform(
    lambda o: 'No_offer' if o.eq('Unemployed').all() else 'Household_offer'
)

Some timeit information (number=1000):

loop
3.6998247
transform + where
1.6054817000000003
transform
1.5982574999999999

import timeit

setup = '''
import numpy as np
import pandas as pd

HouseDF = pd.DataFrame(
    {'House_ID': {0: 9211, 1: 9211, 2: 9211, 3: 9080, 4: 9531, 5: 9531},
     'Resident_name': {0: 'Aaron', 1: 'Emelda', 2: 'Brandt', 3: 'Elise',
                       4: 'Ryan', 5: 'Gillian'},
     'SSN_ID': {0: 1122, 1: 9831, 2: 9744, 3: 8903, 4: 7856, 5: 9002},
     'Occupation': {0: 'Unemployed', 1: 'Unemployed', 2: 'Prosecutor',
                    3: 'Teacher', 4: 'Unemployed', 5: 'Unemployed'},
     'Offer': {0: 'No_offer', 1: 'No_offer', 2: 'Household_offer',
               3: 'No_offer', 4: 'Household_offer', 5: 'Household_offer'}})
'''

loop = '''
df = HouseDF.groupby('House_ID')
mergedDF = pd.DataFrame()
for i in range(len(list(df.groups))) :
    tempDF = df.get_group(list(df.groups)[i])
    employment_list = tempDF['Occupation'].unique().tolist()
    n_occupation = len(tempDF['Occupation'].unique().tolist())
    if ((n_occupation == 1) & ('Unemployment' in employment_list )):
        tempDF['Offer'] = 'No_offer'
    mergedDF = pd.concat([mergedDF,tempDF])
'''

transform1 = '''
all_unemployed = HouseDF.groupby('House_ID')['Occupation'] \
    .transform(lambda o: o.eq('Unemployed').all())

HouseDF['Offer'] = np.where(all_unemployed, 'No_offer', 'Household_offer')
'''

transform2 = '''
HouseDF['Offer'] = HouseDF.groupby('House_ID')['Occupation'].transform(
    lambda o: 'No_offer' if o.eq('Unemployed').all() else 'Household_offer'
)
'''

if __name__ == '__main__':
    print('loop')
    print(timeit.timeit(setup=setup, stmt=loop, number=1000))
    print('transform + where')
    print(timeit.timeit(setup=setup, stmt=transform1, number=1000))
    print('transform')
    print(timeit.timeit(setup=setup, stmt=transform2, number=1000))

Set value based on condition on multiple rows and columns Pandas

Answers (1)

Related Questions