Reputation: 147
I have a question regarding Pandas and the correct indexing and replacing of values.
I have 2 DataFrames, df1 and df2, with the same columns (Col1, Col2, Col3 and Col4).
df1 = pd.DataFrame([['A','b','x',1], ['A','b','y',2], ['A','c','z',3], ['B','b','x',4]], columns=['Col1', 'Col2', 'Col3', 'Col4'])
df2 = pd.DataFrame([['A','b','y',0], ['B','b','x',0]], columns=['Col1','Col2','Col3','Col4'])
df1
Col1 Col2 Col3 Col4
0 A b x 1
1 A b y 2
2 A c z 3
3 B b x 4
df2
Col1 Col2 Col3 Col4
0 A b y 0
1 B b x 0
In df1, I would like to replace the values in Col4 in the rows that match the values of the other columns (Col1, Col2 and Col3) in df2 with another value (let's say 100).
The resulting df1 would look like this:
df1
Col1 Col2 Col3 Col4
0 A b x 1
1 A b y 100
2 A c z 3
3 B b x 100
I have tried with something like this:
columns = list(df1.columns)
columns.remove('Col4')
df1.loc[(df1[cols] == df2[cols].values).all(axis=1)]['Col4']=100
But I am getting errors and I am not sure if this is even achieving what I want.
Upvotes: 3
Views: 545
Reputation: 899
You can try this:
new_df =df1.merge(df2, on=['Col1', 'Col2', 'Col3'], how='left', suffixes=(None, '_y'))
new_df.loc[new_df.Col4_y.notna(), 'Col4'] = 100
new_df.loc[:, df1.columns]
Col1 Col2 Col3 Col4
0 A b x 1
1 A b y 100
2 A c z 3
3 B b x 100
I would also like to note the performance of the approach with merge
and with isin
%%timeit
# best answer
cols = ['Col1', 'Col2', 'Col3']
temp1 = df1.set_index(cols)
temp2 = df2.set_index(cols)
# get the booleans here
booleans = temp1.index.isin(temp2.index)
# this assigns 100 to only rows in Col4
# that are True
df1.loc[booleans, 'Col4'] = 100
3.97 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
cols = ['Col1', 'Col2', 'Col3']
(df1.merge(df2,
on = cols,
how = 'left',
indicator=True,
suffixes = (None, '_y'))
.assign(Col4 = lambda df: np.where(df._merge == 'both',
100,
df.Col4))
.loc[:, df1.columns]
)
5.78 ms ± 660 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%time
# by kelvt
df1 = df1.set_index(['Col1', 'Col2', 'Col3'])
_df2 = df2.set_index(['Col1', 'Col2', 'Col3'])
# set arbitrary value for Col4
_df2['Col4'] = 100
# update values in df1
df1.update(_df2)
# reset index
df1 = df1.reset_index()
21.46 ms ± 609 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%%timeit
# my answer
new_df =df1.merge(df2, on=['Col1', 'Col2', 'Col3'], how='left', suffixes=(None, '_y'))
new_df.loc[new_df.Col4_y.notna(), 'Col4'] = 100
1.49 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is important if your df1 and df2 are large enough
Upvotes: 1
Reputation: 28729
You could do an isin
with the indices, and assign the value via boolean masking:
cols = ['Col1', 'Col2', 'Col3']
temp1 = df1.set_index(cols)
temp2 = df2.set_index(cols)
# get the booleans here
booleans = temp1.index.isin(temp2.index)
# this assigns 100 to only rows in Col4
# that are True
df1.loc[booleans, 'Col4'] = 100
df1
Col1 Col2 Col3 Col4
0 A b x 1
1 A b y 100
2 A c z 3
3 B b x 100
Alternatively, you could resolve it with pd.merge
and the indicator
parameter:
(df1.merge(df2,
on = cols,
how = 'left',
indicator=True,
suffixes = (None, '_y'))
.assign(Col4 = lambda df: np.where(df._merge == 'both',
100,
df.Col4))
.loc[:, df1.columns]
)
Col1 Col2 Col3 Col4
0 A b x 1
1 A b y 100
2 A c z 3
3 B b x 100
Upvotes: 4
Reputation: 1038
Let's use the pandas.DataFrame.update
method:
df1 = df1.set_index(['Col1', 'Col2', 'Col3'])
_df2 = df2.set_index(['Col1', 'Col2', 'Col3'])
# set arbitrary value for Col4
_df2['Col4'] = 100
# update values in df1
df1.update(_df2)
# reset index
df1 = df1.reset_index()
Upvotes: 2
Reputation: 1191
You can always Loop through the data frames and replace the value when you get a match
for index_df1, row_df1 in df1.iterrows():
for index_df2, row_df2 in df2.iterrows():
if (row_df2[:-1] == row_df1[:-1]).all():
df1.loc[index_df1, 'Col4'] = 100
Upvotes: 0