Fhtsm
Fhtsm

Reputation: 147

Filtering for and replacing values in one Pandas DataFrame based on common columns of another DataFrame

I have a question regarding Pandas and the correct indexing and replacing of values.

I have 2 DataFrames, df1 and df2, with the same columns (Col1, Col2, Col3 and Col4).

df1 = pd.DataFrame([['A','b','x',1], ['A','b','y',2], ['A','c','z',3], ['B','b','x',4]], columns=['Col1', 'Col2', 'Col3', 'Col4'])
df2 = pd.DataFrame([['A','b','y',0], ['B','b','x',0]], columns=['Col1','Col2','Col3','Col4'])
    
df1
  Col1 Col2 Col3 Col4
0    A    b    x    1
1    A    b    y    2
2    A    c    z    3
3    B    b    x    4
    
df2
  Col1 Col2 Col3 Col4
0    A    b    y    0
1    B    b    x    0

In df1, I would like to replace the values in Col4 in the rows that match the values of the other columns (Col1, Col2 and Col3) in df2 with another value (let's say 100).

The resulting df1 would look like this:

df1
  Col1 Col2 Col3 Col4
0    A    b    x    1
1    A    b    y  100
2    A    c    z    3
3    B    b    x  100

I have tried with something like this:

columns = list(df1.columns)    
columns.remove('Col4')
df1.loc[(df1[cols] == df2[cols].values).all(axis=1)]['Col4']=100

But I am getting errors and I am not sure if this is even achieving what I want.

Upvotes: 3

Views: 545

Answers (4)

padu
padu

Reputation: 899

You can try this:

new_df =df1.merge(df2, on=['Col1', 'Col2', 'Col3'], how='left', suffixes=(None, '_y'))
new_df.loc[new_df.Col4_y.notna(), 'Col4'] = 100

new_df.loc[:, df1.columns]


   Col1 Col2    Col3    Col4
0   A    b        x       1
1   A    b        y      100
2   A    c        z       3
3   B    b        x      100

I would also like to note the performance of the approach with merge and with isin

%%timeit
# best answer
cols = ['Col1', 'Col2', 'Col3']

temp1 = df1.set_index(cols)

temp2 = df2.set_index(cols)

# get the booleans here 
booleans = temp1.index.isin(temp2.index)


# this assigns 100 to only rows in Col4
# that are True
df1.loc[booleans, 'Col4'] = 100

3.97 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
cols = ['Col1', 'Col2', 'Col3']
(df1.merge(df2, 
           on = cols, 
           how = 'left', 
           indicator=True, 
           suffixes = (None, '_y'))
    .assign(Col4 = lambda df: np.where(df._merge == 'both', 
                                       100, 
                                       df.Col4))
   .loc[:, df1.columns]
)
5.78 ms ± 660 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%time
# by kelvt
df1 = df1.set_index(['Col1', 'Col2', 'Col3'])
_df2 = df2.set_index(['Col1', 'Col2', 'Col3'])

# set arbitrary value for Col4
_df2['Col4'] = 100

# update values in df1
df1.update(_df2)

# reset index
df1 = df1.reset_index()
21.46 ms ± 609 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and

%%timeit
# my answer
new_df =df1.merge(df2, on=['Col1', 'Col2', 'Col3'], how='left', suffixes=(None, '_y'))
new_df.loc[new_df.Col4_y.notna(), 'Col4'] = 100
1.49 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This is important if your df1 and df2 are large enough

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28729

You could do an isin with the indices, and assign the value via boolean masking:


cols = ['Col1', 'Col2', 'Col3']

temp1 = df1.set_index(cols)

temp2 = df2.set_index(cols)

# get the booleans here 
booleans = temp1.index.isin(temp2.index)


# this assigns 100 to only rows in Col4
# that are True
df1.loc[booleans, 'Col4'] = 100

df1

  Col1 Col2 Col3  Col4
0    A    b    x     1
1    A    b    y   100
2    A    c    z     3
3    B    b    x   100

Alternatively, you could resolve it with pd.merge and the indicator parameter:

(df1.merge(df2, 
           on = cols, 
           how = 'left', 
           indicator=True, 
           suffixes = (None, '_y'))
    .assign(Col4 = lambda df: np.where(df._merge == 'both', 
                                       100, 
                                       df.Col4))
   .loc[:, df1.columns]
)

  Col1 Col2 Col3  Col4
0    A    b    x     1
1    A    b    y   100
2    A    c    z     3
3    B    b    x   100

Upvotes: 4

kelvt
kelvt

Reputation: 1038

Let's use the pandas.DataFrame.update method:

df1 = df1.set_index(['Col1', 'Col2', 'Col3'])
_df2 = df2.set_index(['Col1', 'Col2', 'Col3'])

# set arbitrary value for Col4
_df2['Col4'] = 100

# update values in df1
df1.update(_df2)

# reset index
df1 = df1.reset_index()

Upvotes: 2

Patrick
Patrick

Reputation: 1191

You can always Loop through the data frames and replace the value when you get a match

for index_df1, row_df1 in df1.iterrows():
    for index_df2, row_df2 in df2.iterrows():
        if (row_df2[:-1] == row_df1[:-1]).all():
            df1.loc[index_df1, 'Col4'] = 100

Upvotes: 0

Related Questions