Reputation: 984
I have two dataframes of different sizes. They both have four columns: Words, x, y and z.
However, when joining these two dataframes, I want to keep the values of x, y, z of the words that are similar. The words that doesn't exist in df1 but exist in df2 are kept.
I tried to use pd.merge
but this will keep the two values and only similar words. And if I use pd.concat
I have to drop similar elements, but will not be from the first data frame.
df1 = pd.DataFrame({'Words':
['aardvark', 'abalone', 'abandon'],
'x': [0.999, 0.888, 0.777],
'y': [0.999, 0.888, 0.777],
'z': [0.999, 0.888, 0.777]})
df2 = pd.DataFrame({'Words':
['aaaaahh', 'aardvark', 'abalone', 'abandon', 'zoo', 'zoom', 'zucchini'],
'x': [0.199, 0.111, 0.222, 0.333, 0.232, 0.842, 0.945],
'y': [0.929, 0.111, 0.222, 0.333, 0.112, 0.62, 0.265],
'z': [0.993, 0.111, 0.222, 0.333, 0.212, 0.344, 0.745]})
# Expected output
df_res = pd.DataFrame({'Words':
['aaaaahh', 'aardvark', 'abalone', 'abandon', 'zoo', 'zoom', 'zucchini'],
'x': [0.199, 0.999, 0.888, 0.777, 0.232, 0.842, 0.945],
'y': [0.929, 0.999, 0.888, 0.777, 0.112, 0.62, 0.265],
'z': [0.993, 0.999, 0.888, 0.777, 0.212, 0.344, 0.745]})
import pandas as pd
# Merge
df_res = pd.merge(df1, df2, on='Word', how='inner')
# Concat
df_concat = pd.concat(objs=[df1, df2], ignore_index=True)
df_concat = pd.drop_duplicates(subset=['Word'], keep=False, ignore_index=True)
# Compare
d_res = d1[(d1['Word'] != d1['Word'])]
ValueError: Can only compare identically-labeled Series objects
Upvotes: 1
Views: 55
Reputation: 3720
Maybe less performant than @Sayandip Dutta answer, you could try a right join (or left, depending on the order you put arguments in pd.merge):
In [4]: res = pd.merge(df1, df2, how='right', on='Words', suffixes=("_1", "_2"))
In [5]: res
Out[6]:
Words x_1 y_1 z_1 x_2 y_2 z_2
0 aardvark 0.999 0.999 0.999 0.111 0.111 0.111
1 abalone 0.888 0.888 0.888 0.222 0.222 0.222
2 abandon 0.777 0.777 0.777 0.333 0.333 0.333
3 aaaaahh NaN NaN NaN 0.199 0.929 0.993
4 zoo NaN NaN NaN 0.232 0.112 0.212
5 zoom NaN NaN NaN 0.842 0.620 0.344
6 zucchini NaN NaN NaN 0.945 0.265 0.745
Then you can fillna
of x_1, y_1, z_1 with values of x_2, y_2 and z_2.
In [8]: res.x_1.fillna(res.x_2, inplace=True)
In [8]: res.y_1.fillna(res.y_2, inplace=True)
In [9]: res.z_1.fillna(res.z_2, inplace=True)
In [10]: df_res = res[["Words", "x_1", "y_1", ,"z_1"]]
In [11]: df_res
Out[11]:
Words x_1 y_1 z_1
0 aardvark 0.999 0.999 0.999
1 abalone 0.888 0.888 0.888
2 abandon 0.777 0.777 0.777
3 aaaaahh 0.199 0.929 0.993
4 zoo 0.232 0.112 0.212
5 zoom 0.842 0.620 0.344
6 zucchini 0.945 0.265 0.745
Upvotes: 1
Reputation: 15872
You can use df.append
to append df1
to df2
, followed by drop_duplicates
, with keep='last'
, then sort_index
and reset_index
:
>>> (df2.append(df1)
.drop_duplicates('Words', keep='last')
.sort_index()
.reset_index(drop=True))
Words x y z
0 aaaaahh 0.199 0.929 0.993
1 aardvark 0.999 0.999 0.999
2 abalone 0.888 0.888 0.888
3 abandon 0.777 0.777 0.777
4 zoo 0.232 0.112 0.212
5 zoom 0.842 0.620 0.344
6 zucchini 0.945 0.265 0.745
Upvotes: 5