Reputation: 131
I have a huge dataframe, around 450K rows. This dataframe, contains a house ID which then acts as an identifier for a household. For example :
House_ID Resident_name SSN_ID Occupation Offer
9211 Aaron 1122 Unemployed No_offer
9211 Emelda 9831 Unemployed No_offer
9211 Brandt 9744 Prosecutor Household_offer
9080 Elise 8903 Teacher No_offer
9531 Ryan 7856 Unemployed Household_offer
9531 Gillian 9002 Unemployed Household_offer
In the example dataframe, there are 2 households and 1 individual. My aim is to modify/alter column 'Offer' based on column 'Occupation' in a household.
The logics are as below :
The desired output is as below :
House_ID Resident_name SSN_ID Occupation Offer
9211 Aaron 1122 Unemployed Household_offer
9211 Emelda 9831 Unemployed Household_offer
9211 Brandt 9744 Prosecutor Household_offer
9080 Elise 8903 Teacher Household_offer
9531 Ryan 7856 Unemployed No_offer
9531 Gillian 9002 Unemployed No_offer
The reason Aaron and Emelda received a household offer (even when they're both unemployed) is that the other member of their household is still employed, while Ryan's household members receives no offer because none of them are employed.
My current solution is using grouby
and loop, where I group each household using the household ID then go thru each member employment status before assigning the offer, as below :
df = HouseDF.groupby('house_ID')
mergedDF = pd.DataFrame()
for i in range(len(list(df.groups))) :
tempDF = df.get_group(list(df.groups)[i])
employment_list = tempDF['Occupation'].unique().tolist()
n_occupation = len(tempDF['Occupation'].unique().tolist())
if ((n_occupation == 1) & ('Unemployment' in employment_list )):
tempDF['Offer'] = 'No_offer'
mergedDF = pd.concat([mergedDF,tempDF])
While the solution barely works, it takes too long to finish due to the iteration since the original dataset has around 200K house_ID (but 450K SSN_ID), and looping thru those households (and iterate each of its members to check their employment status) will be too time-consuming and inefficient.
Is there any solution that is much more efficient for setting a value based on the condition of multiple rows and columns?
Thank you~
Upvotes: 1
Views: 856
Reputation: 35626
Try groupby transform to check each household with all to check if all members of the household are unemployed, then np.where to assign the 'No_offer', 'Household_offer' values:
import numpy as np
import pandas as pd
HouseDF = pd.DataFrame({
'House_ID': {0: 9211, 1: 9211, 2: 9211, 3: 9080, 4: 9531, 5: 9531},
'Resident_name': {0: 'Aaron', 1: 'Emelda', 2: 'Brandt', 3: 'Elise',
4: 'Ryan', 5: 'Gillian'},
'SSN_ID': {0: 1122, 1: 9831, 2: 9744, 3: 8903, 4: 7856, 5: 9002},
'Occupation': {0: 'Unemployed', 1: 'Unemployed', 2: 'Prosecutor',
3: 'Teacher', 4: 'Unemployed', 5: 'Unemployed'},
'Offer': {0: 'No_offer', 1: 'No_offer', 2: 'Household_offer',
3: 'No_offer', 4: 'Household_offer', 5: 'Household_offer'}
})
all_unemployed = HouseDF.groupby('House_ID')['Occupation'] \
.transform(lambda o: o.eq('Unemployed').all())
HouseDF['Offer'] = np.where(all_unemployed, 'No_offer', 'Household_offer')
print(HouseDF)
HouseDF
:
House_ID Resident_name SSN_ID Occupation Offer
0 9211 Aaron 1122 Unemployed Household_offer
1 9211 Emelda 9831 Unemployed Household_offer
2 9211 Brandt 9744 Prosecutor Household_offer
3 9080 Elise 8903 Teacher Household_offer
4 9531 Ryan 7856 Unemployed No_offer
5 9531 Gillian 9002 Unemployed No_offer
Groupby Transform without np.where option:
HouseDF['Offer'] = HouseDF.groupby('House_ID')['Occupation'].transform(
lambda o: 'No_offer' if o.eq('Unemployed').all() else 'Household_offer'
)
Some timeit information (number=1000):
loop
3.6998247
transform + where
1.6054817000000003
transform
1.5982574999999999
import timeit
setup = '''
import numpy as np
import pandas as pd
HouseDF = pd.DataFrame(
{'House_ID': {0: 9211, 1: 9211, 2: 9211, 3: 9080, 4: 9531, 5: 9531},
'Resident_name': {0: 'Aaron', 1: 'Emelda', 2: 'Brandt', 3: 'Elise',
4: 'Ryan', 5: 'Gillian'},
'SSN_ID': {0: 1122, 1: 9831, 2: 9744, 3: 8903, 4: 7856, 5: 9002},
'Occupation': {0: 'Unemployed', 1: 'Unemployed', 2: 'Prosecutor',
3: 'Teacher', 4: 'Unemployed', 5: 'Unemployed'},
'Offer': {0: 'No_offer', 1: 'No_offer', 2: 'Household_offer',
3: 'No_offer', 4: 'Household_offer', 5: 'Household_offer'}})
'''
loop = '''
df = HouseDF.groupby('House_ID')
mergedDF = pd.DataFrame()
for i in range(len(list(df.groups))) :
tempDF = df.get_group(list(df.groups)[i])
employment_list = tempDF['Occupation'].unique().tolist()
n_occupation = len(tempDF['Occupation'].unique().tolist())
if ((n_occupation == 1) & ('Unemployment' in employment_list )):
tempDF['Offer'] = 'No_offer'
mergedDF = pd.concat([mergedDF,tempDF])
'''
transform1 = '''
all_unemployed = HouseDF.groupby('House_ID')['Occupation'] \
.transform(lambda o: o.eq('Unemployed').all())
HouseDF['Offer'] = np.where(all_unemployed, 'No_offer', 'Household_offer')
'''
transform2 = '''
HouseDF['Offer'] = HouseDF.groupby('House_ID')['Occupation'].transform(
lambda o: 'No_offer' if o.eq('Unemployed').all() else 'Household_offer'
)
'''
if __name__ == '__main__':
print('loop')
print(timeit.timeit(setup=setup, stmt=loop, number=1000))
print('transform + where')
print(timeit.timeit(setup=setup, stmt=transform1, number=1000))
print('transform')
print(timeit.timeit(setup=setup, stmt=transform2, number=1000))
Upvotes: 1