Yale Newman
Yale Newman

Reputation: 1231

Looking to update slices of a DataFrame based on a condition

Question:

I am getting data from multiple sources looking for the same data points. I would like to update my main dataframe to reflect availability found in these supplemental files. Each supplemental file is specific to one retailer that is found in the main dataframe.

Data:

data = {
"sku": {
"0": "123",
"1": "321",
"2": "456",
"3": "678",
"4": "123",
"5": "321",
"6": "456",
"7": "101",
"8": "123",
"9": "101"
},
"retailer": {
"0": "BobStore",
"1": "BobStore",
"2": "BobStore",
"3": "BobStore",
"4": "SamStore",
"5": "SamStore",
"6": "RobStore",
"7": "SamStore",
"8": "RobStore",
"9": "BobStore"
},
"description": {
"0": "Jacket",
"1": "Jacket",
"2": "Jacket",
"3": "Jacket",
"4": "Jacket",
"5": "Jacket",
"6": "Jacket",
"7": "Jacket",
"8": "Jacket",
"9": "Jacket"
},
"price": {
"0": 19.99,
"1": 18.99,
"2": 12.99,
"3": 15.99,
"4": 12.99,
"5": 12.99,
"6": 11.99,
"7": 19.99,
"8": 16.99,
"9": 18.99
 },
 "shipping": {
"0": 6.99,
"1": 4.99,
"2": 6.99,
"3": 3.99,
"4": 6.99,
"5": 4.99,
"6": 6.99,
"7": 9.99,
"8": 1.99,
"9": 2.99,
},
"availability": {
"0": "True",
"1": "False",
"2": "",
"3": "",
"4": "False",
"5": "True",
"6": "",
"7": "",
"8": "",
"9": "True",
}
}
df = pd.DataFrame(data=data)

data = {
  "sku": {
  "0": "123",
  "1": "101",
  "2": "456",
  "3": "879",
  },
 "availability": {
  "0": "False",
  "1": "True",
  "2": "True",
  "3": "True",
  }
 }
bobStore = pd.DataFrame(data=data)

data = {
    "sku": {
    "0": "123",
    "1": "101",
 },
  "availability": {
  "0": "False",
  "1": "True",
 }
  }
samStore = pd.DataFrame(data=data)

Attempted Solution:

df.loc[df['retailer']=='BobStore'].set_index('sku').update(bobStore.set_index('sku'))

df.loc[df['retailer']=='BobStore'].set_index('sku').update(bobStore.set_index('sku'))

Desired Output:

data = {"availability":{"0":"False","1":"False","2":"True","3":"","4":"False","5":"True","6":"","7":"True","8":"","9":"True"},"description":{"0":"Jacket","1":"Jacket","2":"Jacket","3":"Jacket","4":"Jacket","5":"Jacket","6":"Jacket","7":"Jacket","8":"Jacket","9":"Jacket"},"price":{"0":19.99,"1":18.99,"2":12.99,"3":15.99,"4":12.99,"5":12.99,"6":11.99,"7":19.99,"8":16.99,"9":18.99},"retailer":{"0":"BobStore","1":"BobStore","2":"BobStore","3":"BobStore","4":"SamStore","5":"SamStore","6":"RobStore","7":"SamStore","8":"RobStore","9":"BobStore"},"shipping":{"0":6.99,"1":4.99,"2":6.99,"3":3.99,"4":6.99,"5":4.99,"6":6.99,"7":9.99,"8":1.99,"9":2.99},"sku":{"0":"123","1":"321","2":"456","3":"678","4":"123","5":"321","6":"456","7":"101","8":"123","9":"101"}}
do = pd.DataFrame(data=data)

Bonus Points:

Help me understand why update doesn't work like I expected? I was able to toggle values when using a multi-index of sku and retailer, but I wasn't able to toggle the values when using the boolean mask to select the slice I wanted to update.

Multi-Index Solution:

df.set_index(['sku','retailer'], inplace=True)
bobStore['retailer'] = 'BobStore'
bobStore.set_index(['sku','retailer'],inplace=True)
df.update(bobStore)

ColdSpeed Solution Attempt:

test1 = bobStore[bobStore['sku'].isin(np.intersect1d(bobStore['sku'],df['sku']))].combine_first(df[df['retailer']=='BobStore']).combine_first(df)

This leaves out the 'BobStore' sku 321 'availability' of False

Upvotes: 0

Views: 50

Answers (3)

cs95
cs95

Reputation: 402533

Can't confirm without expected output, but I think you might be looking for merge + combine_first.

m1 = df.sku.isin(bobStore.sku) & df.retailer.eq('BobStore') 
m2 = df.sku.isin(samStore.sku) & df.retailer.eq('SamStore')  

i = df[m1].drop('availability', 1)\
        .merge(bobStore, on='sku', how='left').set_index(df[m1].index)
j = df[m2].drop('availability', 1)\
       .merge(samStore, on='sku', how='left').set_index(df[m2].index)

print (i.combine_first(j).combine_first(df))

  availability description  price  retailer  shipping  sku
0        False      Jacket  19.99  BobStore      6.99  123
1        False      Jacket  18.99  BobStore      4.99  321
2         True      Jacket  12.99  BobStore      6.99  456
3                   Jacket  15.99  BobStore      3.99  678
4        False      Jacket  12.99  SamStore      6.99  123
5         True      Jacket  12.99  SamStore      4.99  321
6                   Jacket  11.99  RobStore      6.99  456
7         True      Jacket  19.99  SamStore      9.99  101
8                   Jacket  16.99  RobStore      1.99  123
9         True      Jacket  18.99  BobStore      2.99  101

Upvotes: 1

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

One with pd.concat and maping based on newly created keys i.e

bobStore['col'] = 'BobStore'
samStore['col'] = 'SamStore'
new = pd.concat([bobStore,samStore],0)

x = df[['retailer','sku']].sum(1).map(new.set_index(new[['col','sku']].sum(1))['availability'])

df['availability'] = x.combine_first(df['availability'])

Output :

 availability description  price  retailer  shipping  sku
0        False      Jacket  19.99  BobStore      6.99  123
1        False      Jacket  18.99  BobStore      4.99  321
2         True      Jacket  12.99  BobStore      6.99  456
3                   Jacket  15.99  BobStore      3.99  678
4        False      Jacket  12.99  SamStore      6.99  123
5         True      Jacket  12.99  SamStore      4.99  321
6                   Jacket  11.99  RobStore      6.99  456
7         True      Jacket  19.99  SamStore      9.99  101
8                   Jacket  16.99  RobStore      1.99  123
9         True      Jacket  18.99  BobStore      2.99  101

Upvotes: 1

BENY
BENY

Reputation: 323276

This is by using Multi-Index :)( I used two methods for index slice)

bobStore=bobStore.loc[bobStore.sku.isin(df.loc[df.retailer=='BobStore','sku'])]
samStore=samStore.loc[samStore.sku.isin(df.loc[df.retailer=='SamStore','sku'])]
df=df.reset_index().sort_values(['retailer','sku']).set_index(['retailer','sku'])
idx = pd.IndexSlice
df.loc[(slice('BobStore'),bobStore.sku.tolist()),'availability']=bobStore.sort_values('sku').availability.values
df.loc[idx['SamStore',samStore.sku.tolist()],'availability']=samStore.sort_values('sku').availability.values
df.reset_index().set_index('index').sort_index()

Out[362]: 
       retailer  sku availability description  price  shipping
index                                                         
0      BobStore  123        False      Jacket  19.99      6.99
1      BobStore  321        False      Jacket  18.99      4.99
2      BobStore  456         True      Jacket  12.99      6.99
3      BobStore  678                   Jacket  15.99      3.99
4      SamStore  123        False      Jacket  12.99      6.99
5      SamStore  321         True      Jacket  12.99      4.99
6      RobStore  456                   Jacket  11.99      6.99
7      SamStore  101         True      Jacket  19.99      9.99
8      RobStore  123                   Jacket  16.99      1.99
9      BobStore  101         True      Jacket  18.99      2.99

Upvotes: 1

Related Questions