Reputation: 35
I have a dataframe (lets call it goodsdf) which consists of 240K rows and a bunch of columns:
id | name | description | ... | availability | price |
---|---|---|---|---|---|
1001 | Item A | Frying pan | ... | 3 | 20.1 |
2031 | Item B | Firewood | ... | 0 | 5 |
3412 | Item C | Olive oil | ... | 10 | 12.5 |
Now in the next step, I'm constantly reading a stream of updated items. Those updates include among others new prices for items, which are pulled every 90 seconds. The stream I'm receiving includes also some additional 100K items which are not of interest for my store.
What I'm looking into doing is to update the dataframe with new prices. To do so I use the following (partially pseudo) code:
for entity in feed.entity:
if entity.HasField('product_update'):
if entity.product_update.id == goodsdf['id']: #pseudo
if goodsdf['availability'] != 0: #pseudo
set goodsdf['price'] == entity.product_update.price #pseudo
From what I have read, there are several different ways for accessing values in dataframes, e.g. by using isin()
, str.contains()
and a couple of others. However, many of them return True
and False
values only. Another way I tried to solve this is by reading new prices and specific item IDs into separate dataframes, which are later merged into my original goodsdf. This in turn showed to create penalties for time and computer resources.
I'm not quite sure I fully understand the concept using nested if statements in combination with updating values in dataframes.
Upvotes: 0
Views: 66
Reputation: 9619
One approach could be to extract the id
and price
first to create a flattened list of dicts, load this as a new dataframe, merge it with the other based on id
, then replace the prices in the original df when they meet the conditions with pd.where()
. I'm not sure if it's efficient enough for your use case, but at least it avoids looping through the data:
import pandas as pd
feed = {'entity1':{'product_update':{'id':1001, 'price':999}}, 'entity2':{'product_update':{'id':2031, 'price':999}}, 'entity3':{'superfluous':'test'}}
extracted_feed_data = [v for val in feed.values() if (v := val.get('product_update'))]
data = [ { "id": 1001, "name": "Item A", "description": "Frying pan", "availability": 3, "price": 20.1 }, { "id": 2031, "name": "Item B", "description": "Firewood", "availability": 0, "price": 5 }, { "id": 3412, "name": "Item C", "description": "Olive oil", "availability": 10, "price": 12.5 } ]
df_update = pd.DataFrame(extracted_feed_data)
df = pd.DataFrame(data)
merged = df.merge(df_update, on='id', how='left')
df['price'] = merged['price_y'].where((df['availability'] != 0) & (merged['price_y'].notnull()), df['price'])
Output df
:
| | id | name | description | availability | price |
|---:|-----:|:-------|:--------------|---------------:|--------:|
| 0 | 1001 | Item A | Frying pan | 3 | 999 |
| 1 | 2031 | Item B | Firewood | 0 | 5 |
| 2 | 3412 | Item C | Olive oil | 10 | 12.5 |
Upvotes: 1