Reputation: 763
I'm quite new to pandas dataframes, and I'm experiencing some troubles joining two tables.
The first df has just 3 columns:
DF1
:
item_id position document_id
336 1 10
337 2 10
338 3 10
1001 1 11
1002 2 11
1003 3 11
38 10 146
And the second has exactly same two columns (and plenty of others):
DF2
:
item_id document_id col1 col2 col3 ...
337 10 ... ... ...
1002 11 ... ... ...
1003 11 ... ... ...
What I need is to perform an operation which, in SQL, would look as follows:
DF1 join DF2 on
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id
And, as a result, I want to see DF2, complemented with column 'position':
item_id document_id position col1 col2 col3 ...
What is a good way to do this using pandas?
Upvotes: 30
Views: 65826
Reputation: 23041
If you're merging on all common columns as in the OP, you don't even need to pass on=
, simply calling merge()
will do the job.
merged_df = df1.merge(df2)
The reason is that under the hood, if on=
is not passed, pd.Index.intersection
is called on the columns to determine the common columns and merge on all of them.
A special thing about merging on common columns is that it doesn't matter which dataframe is on the right or the left, the rows filtered are the same because they are selected by looking up matching rows on the common columns. The only difference is where the columns are positioned; the columns in the right dataframe that are not in the left dataframe will be added to the right of the columns on the left dataframe. So unless the order of the columns matter (which can be very easily fixed using column selection or reindex()
), it doesn't really matter which dataframe is on the right and which is on the left. In other words,
df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)
# df12 and df21 are the same.
df12.equals(df21) # True
This is not true if the columns to be merged on don't have the same name and you have to pass left_on=
and right_on=
(see #1 in this answer).
Upvotes: 0
Reputation: 862481
I think you need merge
with default inner
join, but is necessary no duplicated combinations of values in both columns:
print (df2)
item_id document_id col1 col2 col3
0 337 10 s 4 7
1 1002 11 d 5 8
2 1003 11 f 7 0
df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
item_id position document_id col1 col2 col3
0 337 2 10 s 4 7
1 1002 2 11 d 5 8
2 1003 3 11 f 7 0
But if necessary position
column in position 3
:
df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
item_id document_id position col1 col2 col3
0 337 10 2 s 4 7
1 1002 11 2 d 5 8
2 1003 11 3 f 7 0
Upvotes: 54