Taewoo.Lim
Taewoo.Lim

Reputation: 223

python pandas dataframe find row containing specific value and return boolean

I want compare two dataframes which is df1 and df2. df1 is a data that updates every hour by it self. df2 is a dataframe that alreay exists. I want to append specific row that is updated.

for example, Here is df1

df1:

fd1

which contains 5 rows of information

and df2 which already existed

df2:

df2

we can tell that eric is added but df2 is not representing that.

I could overwrite df2 with df1 but I shouldn't as there will be remark that will be updated by person later the data is written.

So, I decided to remove each rows of data by finding it from df2 by its id, and remove them with for loop

and after that, there will be only eric's row will remain which will make me possible to just append eric to df2.

So what I tried is this

for index, row in df1.iterrows():
    id = row['id']
    if df2.loc[df1['id'].isin(id)] = True:
        df1[df1.id != id)

and it returns syntax error....

am I on the right track? is it will be the best solution to solve this problem?and how should I change the code to achieve my goal?

Upvotes: 1

Views: 1991

Answers (3)

piRSquared
piRSquared

Reputation: 294258

Let's assume that 'steve' has a remark that we want to preserve in df1 and that 'jack' has a remark we want to save in df2. We can set the indices of each dataframe to ['id', 'name'] and use pd.Series.combine_first

Setup

df1 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 90, 13],
    name='james steve jack ted eric bob'.split(),
    remark='',
))
df1.at[1, 'remark'] = 'meh'

df2 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 13],
    name='james steve jack ted bob'.split(),
    remark='',
))
df2.at[2, 'remark'] = 'smart'

Solution

s1 = df1.set_index(['id', 'name']).remark
s2 = df2.set_index(['id', 'name']).remark

s1.mask(s1.eq('')).combine_first(s2.mask(s2.eq(''))).fillna('').reset_index()

   id   name remark
0  12  james       
1  13    bob       
2  34  steve    meh
3  56   jack  smart
4  78    ted       
5  90   eric       

However, supposing it was exactly as OP presented it!

Setup

df1 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 90, 13],
    name='james steve jack ted eric bob'.split(),
    remark='',
))

df2 = pd.DataFrame(dict(
    id=[12, 34, 56, 78, 13],
    name='james steve jack ted bob'.split(),
    remark='',
))
df2.at[2, 'remark'] = 'smart'

Solution

df2.append(df1).drop_duplicates(['id', 'name']).reset_index(drop=True)

   id   name remark
0  12  james       
1  34  steve       
2  56   jack  smart
3  78    ted       
4  13    bob       
5  90   eric    

Upvotes: 2

BENY
BENY

Reputation: 323226

To fix your code ...

l=[]
for index, row in df1.iterrows():
    id = row['Id']
    if sum(df2['Id'].isin([id]))>0:
        l.append(id)
l
Out[334]: [0, 1, 2, 3, 4] # those are the row you need to remove 

df1.loc[~df1.index.isin(l)]# you remove them by using `~` + .isin 
Out[339]: 
   Id Name
5   5    F
6   6    G

By using pd.concat

pd.concat([df2,df1[~df1.Id.isin(df2.Id)]],axis=0)
Out[337]: 
   Id Name
0   0    A
1   1    B
2   2    C
3   3    D
4   4    E
5   5    F
6   6    G

Data Input

fake = {'Id' : [0,1,2,3,4,5,6],
         'Name' : ['A','B','C','D','E','F','G']}
df1 = pd.DataFrame(fake)

fake = {'Id' : [0,1,2,3,4],
         'Name' : ['A','B','C','D','E']}
df2 = pd.DataFrame(fake)

Upvotes: 2

AetherUnbound
AetherUnbound

Reputation: 1744

Pandas has several functions available that allow merging and joining different DataFrames. One you could use here is merge: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

>>>merged = df1.merge(df2, how='left')
    id   name remark
0  234  james       
1  212  steve       
2  153   jack  smart
3  567    ted       
4  432   eric    NaN
5  543    bob   

If you didn't want the inserted value to be NaN, you could always use fillna: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html.

Upvotes: 0

Related Questions