Reputation: 1231
Question:
I am getting data from multiple sources looking for the same data points. I would like to update my main dataframe to reflect availability found in these supplemental files. Each supplemental file is specific to one retailer that is found in the main dataframe.
Data:
data = {
"sku": {
"0": "123",
"1": "321",
"2": "456",
"3": "678",
"4": "123",
"5": "321",
"6": "456",
"7": "101",
"8": "123",
"9": "101"
},
"retailer": {
"0": "BobStore",
"1": "BobStore",
"2": "BobStore",
"3": "BobStore",
"4": "SamStore",
"5": "SamStore",
"6": "RobStore",
"7": "SamStore",
"8": "RobStore",
"9": "BobStore"
},
"description": {
"0": "Jacket",
"1": "Jacket",
"2": "Jacket",
"3": "Jacket",
"4": "Jacket",
"5": "Jacket",
"6": "Jacket",
"7": "Jacket",
"8": "Jacket",
"9": "Jacket"
},
"price": {
"0": 19.99,
"1": 18.99,
"2": 12.99,
"3": 15.99,
"4": 12.99,
"5": 12.99,
"6": 11.99,
"7": 19.99,
"8": 16.99,
"9": 18.99
},
"shipping": {
"0": 6.99,
"1": 4.99,
"2": 6.99,
"3": 3.99,
"4": 6.99,
"5": 4.99,
"6": 6.99,
"7": 9.99,
"8": 1.99,
"9": 2.99,
},
"availability": {
"0": "True",
"1": "False",
"2": "",
"3": "",
"4": "False",
"5": "True",
"6": "",
"7": "",
"8": "",
"9": "True",
}
}
df = pd.DataFrame(data=data)
data = {
"sku": {
"0": "123",
"1": "101",
"2": "456",
"3": "879",
},
"availability": {
"0": "False",
"1": "True",
"2": "True",
"3": "True",
}
}
bobStore = pd.DataFrame(data=data)
data = {
"sku": {
"0": "123",
"1": "101",
},
"availability": {
"0": "False",
"1": "True",
}
}
samStore = pd.DataFrame(data=data)
Attempted Solution:
df.loc[df['retailer']=='BobStore'].set_index('sku').update(bobStore.set_index('sku'))
df.loc[df['retailer']=='BobStore'].set_index('sku').update(bobStore.set_index('sku'))
Desired Output:
data = {"availability":{"0":"False","1":"False","2":"True","3":"","4":"False","5":"True","6":"","7":"True","8":"","9":"True"},"description":{"0":"Jacket","1":"Jacket","2":"Jacket","3":"Jacket","4":"Jacket","5":"Jacket","6":"Jacket","7":"Jacket","8":"Jacket","9":"Jacket"},"price":{"0":19.99,"1":18.99,"2":12.99,"3":15.99,"4":12.99,"5":12.99,"6":11.99,"7":19.99,"8":16.99,"9":18.99},"retailer":{"0":"BobStore","1":"BobStore","2":"BobStore","3":"BobStore","4":"SamStore","5":"SamStore","6":"RobStore","7":"SamStore","8":"RobStore","9":"BobStore"},"shipping":{"0":6.99,"1":4.99,"2":6.99,"3":3.99,"4":6.99,"5":4.99,"6":6.99,"7":9.99,"8":1.99,"9":2.99},"sku":{"0":"123","1":"321","2":"456","3":"678","4":"123","5":"321","6":"456","7":"101","8":"123","9":"101"}}
do = pd.DataFrame(data=data)
Bonus Points:
Help me understand why update doesn't work like I expected? I was able to toggle values when using a multi-index of sku and retailer, but I wasn't able to toggle the values when using the boolean mask to select the slice I wanted to update.
Multi-Index Solution:
df.set_index(['sku','retailer'], inplace=True)
bobStore['retailer'] = 'BobStore'
bobStore.set_index(['sku','retailer'],inplace=True)
df.update(bobStore)
ColdSpeed Solution Attempt:
test1 = bobStore[bobStore['sku'].isin(np.intersect1d(bobStore['sku'],df['sku']))].combine_first(df[df['retailer']=='BobStore']).combine_first(df)
This leaves out the 'BobStore' sku 321 'availability' of False
Upvotes: 0
Views: 50
Reputation: 402533
Can't confirm without expected output, but I think you might be looking for merge
+ combine_first
.
m1 = df.sku.isin(bobStore.sku) & df.retailer.eq('BobStore')
m2 = df.sku.isin(samStore.sku) & df.retailer.eq('SamStore')
i = df[m1].drop('availability', 1)\
.merge(bobStore, on='sku', how='left').set_index(df[m1].index)
j = df[m2].drop('availability', 1)\
.merge(samStore, on='sku', how='left').set_index(df[m2].index)
print (i.combine_first(j).combine_first(df))
availability description price retailer shipping sku
0 False Jacket 19.99 BobStore 6.99 123
1 False Jacket 18.99 BobStore 4.99 321
2 True Jacket 12.99 BobStore 6.99 456
3 Jacket 15.99 BobStore 3.99 678
4 False Jacket 12.99 SamStore 6.99 123
5 True Jacket 12.99 SamStore 4.99 321
6 Jacket 11.99 RobStore 6.99 456
7 True Jacket 19.99 SamStore 9.99 101
8 Jacket 16.99 RobStore 1.99 123
9 True Jacket 18.99 BobStore 2.99 101
Upvotes: 1
Reputation: 30605
One with pd.concat and maping based on newly created keys i.e
bobStore['col'] = 'BobStore'
samStore['col'] = 'SamStore'
new = pd.concat([bobStore,samStore],0)
x = df[['retailer','sku']].sum(1).map(new.set_index(new[['col','sku']].sum(1))['availability'])
df['availability'] = x.combine_first(df['availability'])
Output :
availability description price retailer shipping sku 0 False Jacket 19.99 BobStore 6.99 123 1 False Jacket 18.99 BobStore 4.99 321 2 True Jacket 12.99 BobStore 6.99 456 3 Jacket 15.99 BobStore 3.99 678 4 False Jacket 12.99 SamStore 6.99 123 5 True Jacket 12.99 SamStore 4.99 321 6 Jacket 11.99 RobStore 6.99 456 7 True Jacket 19.99 SamStore 9.99 101 8 Jacket 16.99 RobStore 1.99 123 9 True Jacket 18.99 BobStore 2.99 101
Upvotes: 1
Reputation: 323276
This is by using Multi-Index :)( I used two methods for index slice)
bobStore=bobStore.loc[bobStore.sku.isin(df.loc[df.retailer=='BobStore','sku'])]
samStore=samStore.loc[samStore.sku.isin(df.loc[df.retailer=='SamStore','sku'])]
df=df.reset_index().sort_values(['retailer','sku']).set_index(['retailer','sku'])
idx = pd.IndexSlice
df.loc[(slice('BobStore'),bobStore.sku.tolist()),'availability']=bobStore.sort_values('sku').availability.values
df.loc[idx['SamStore',samStore.sku.tolist()],'availability']=samStore.sort_values('sku').availability.values
df.reset_index().set_index('index').sort_index()
Out[362]:
retailer sku availability description price shipping
index
0 BobStore 123 False Jacket 19.99 6.99
1 BobStore 321 False Jacket 18.99 4.99
2 BobStore 456 True Jacket 12.99 6.99
3 BobStore 678 Jacket 15.99 3.99
4 SamStore 123 False Jacket 12.99 6.99
5 SamStore 321 True Jacket 12.99 4.99
6 RobStore 456 Jacket 11.99 6.99
7 SamStore 101 True Jacket 19.99 9.99
8 RobStore 123 Jacket 16.99 1.99
9 BobStore 101 True Jacket 18.99 2.99
Upvotes: 1