Xiaowan Wen
Xiaowan Wen

Reputation: 145

How to compare two Dataframe and update a particular column in one of the Dataframe?

Update: I am having some issue with the answers and guessing it's something to do with the rows are not matching. (updated the today_df so that today and yesterday has diff # of rows).

I need to compare yesterday's dataframe and today's new dataframe to update one column on today's dataframe. for example: yesterday_df:

fruit      price  qty   comment
apple       1.5   10    In Stock
banana      0.7   20    In Stock
Pear        3.0   5     In Stock
Watermelon  1.2   30    Out Stock

today_df (updated the today_df so that today and yesterday has diff # of rows)

fruit      price  qty   comment
apple       1.5   10    
Strawberry  1.7   20    
Pineapple   0.9   5     
Watermelon  1.2   30   
Peach       2.1   10    

Desired output: today_df

fruit      price  qty   comment
apple       1.5   10    In Stock  <-- from yesterday
Strawberry  1.7   20               <-- empty
Pineapple   0.9   5                 <-- empty
Watermelon  1.2   30    Out Stock   <-- From Yesterday
peach       2.1   10                <-- empty

For Today's Dataframe, the comment from previous day for Apple and Watermelon were carried over and the new info (Strawberry and Pineapple) from data were left blank.

i've tried to use series mapping:

fruit_map = yesterday_df.set_index('fruit')['Comment']
today_df['Comment']=today_df['fruit'].map(fruit_map).fillna(today_df['Comment'])
today_df

but getting this error:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

fruit_map = yesterday_df.set_index('fruit')['Comment']
today_df['Comment']=today_df['fruit'].map(fruit_map).fillna(today_df['Comment'])
today_df

today_df
fruit      price  qty   comment
apple       1.5   10    In Stock
Strawberry  1.7   20    
Pineapple   0.9   5     
Watermelon  1.2   30    Out Stock

Upvotes: 2

Views: 4827

Answers (2)

Andy L.
Andy L.

Reputation: 25239

Edit:
Since your yesterday_df has some duplicates in fruit, you commands throw the errors you posted. Any solution using index also errors out. You have 2 ways to solve it.

Method 1: get rid duplicates in yesterday_df and your solution using map as you posted would work without any issue.

Method 2: Change your solution a little bit with groupby yesterday_df on fruit and pick first of each group to use for updating today_df as follows:

fruit_map = y_df.groupby('fruit').first()['comment']
t_df['comment'] = t_df['fruit'].map(fruit_map).fillna(t_df['comment'])

Original: with data as belows:

y_df

Out[2896]:
        fruit  price  qty    comment
0       apple    1.5   10   In Stock
1      banana    0.7   20   In Stock
2        Pear    3.0    5   In Stock
3  Watermelon    1.2   30  Out Stock

t_df
Out[2897]:
        fruit  price  qty comment
0       apple    1.5   10
1  Strawberry    1.7   20
2   Pineapple    0.9    5
3  Watermelon    1.2   30

You just need this command

t_df.comment = y_df.comment.where((y_df.price == t_df.price) & (y_df.fruit == t_df.fruit), '')

Out[2908]:
        fruit  price  qty    comment
0       apple    1.5   10   In Stock
1  Strawberry    1.7   20
2   Pineapple    0.9    5
3  Watermelon    1.2   30  Out Stock

Upvotes: 2

aiguofer
aiguofer

Reputation: 2137

You were on the right path, but you can leverage indexes to assign values so there's no need for map. When you assign values to a dataframe using a Series, it'll align on the index. The key here is to make sure that you're using "fruit" as the index for both dataframes. The easiest way to do this is:

import pandas as pd
from pandas.compat import StringIO

yesterday = """
fruit,price,qty,comment,col to show it works with different shapes
apple,1.5,10,In Stock,bleh
banana,0.7,20,In Stock,blah
Pear,3.0,5,In Stock,bip
Watermelon,1.2,30,Out Stock,bop
"""

today = """
fruit,price,qty,comment
apple,1.5,10,
Strawberry,1.7,20,
Pineapple,0.9,5,
Watermelon,1.2,30,
Peach,2.1,10,
"""

yesterday_df = pd.read_csv(StringIO(yesterday), sep=",").set_index("fruit")
today_df = pd.read_csv(StringIO(today), sep=",").set_index("fruit")

today_df["comment"] = yesterday_df["comment"]

today_df

Out[1]: 
            price  qty    comment
fruit                            
apple         1.5   10   In Stock
Strawberry    1.7   20        NaN
Pineapple     0.9    5        NaN
Watermelon    1.2   30  Out Stock
Peach         2.1   10        NaN

Alternatively, you could set the index to 'fruit' when you read in your data and avoid the set_index call.

You might also want to clean up you index ('fruit' column) by lower casing it since string indexes are case sensitive. For example, if yesterday you had "Apple" and today it was "apple", this wouldn't work.

Also, I'm assuming you're not interested in changing anything about 'price' or 'qty'.

Upvotes: 2

Related Questions