Reputation: 8145
I have two DataFrames with the following column names:
frame_1:
event_id, date, time, county_ID
frame_2:
countyid, state
I would like to get a DataFrame with the following columns by left-joining on county_ID = countyid
:
joined_dataframe:
event_id, date, time, county, state
I cannot figure out how to do it if the columns on which I want to join are not the index.
Upvotes: 167
Views: 281393
Reputation: 23071
Merging using differently named columns duplicates columns; for example, after the call frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid')
, both county_ID
and countyid
columns are created on the joined frame but they have exactly the same values for every row, so presumably only one of them is needed. To not have this problem from the beginning, rename the merge-key column be the same and merge on that column.
df1 = frame_1.rename(columns={'county_ID':'county'})
df2 = frame_2.rename(columns={'countyid':'county'})
joined_frame = df1.merge(df2, on='county', how='left')
Also, if the second frame has only one new additional column (e.g. state
) as in the OP, then you can map that column to frame_1
via the common column.
frame_1['state'] = frame_1['county_ID'].map(frame_2.set_index('countyid')['state'])
If county_ID
and countyid
were indices (not columns), then a straightforward join
call does the job. By default, it performs left join.
joined_frame = frame_1.join(frame_2)
One nice thing about join
is that if you want to join multiple dataframes on index, then you can pass a list of dataframes and join efficiently (instead of multiple chained merge
calls).
joined_frame = frame_1.join([frame_2, frame_3])
Upvotes: 0
Reputation: 13955
You can use the left_on
and right_on
options of pd.merge as follows:
pd.merge(frame_1, frame_2, left_on='county_ID', right_on='countyid')
Or equivalently with DataFrame.merge:
frame_1.merge(frame_2, left_on='county_ID', right_on='countyid')
I was not sure from the question if you only wanted to merge if the key was in the left hand DataFrame. If that is the case then the following will do that (the above will in effect do a many to many merge)
pd.merge(frame_1, frame_2, how='left', left_on='county_ID', right_on='countyid')
Or
frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid')
Upvotes: 292
Reputation: 77951
you need to make county_ID
as index for the right frame:
frame_2.join ( frame_1.set_index( [ 'county_ID' ], verify_integrity=True ),
on=[ 'countyid' ], how='left' )
for your information, in pandas left join breaks when the right frame has non unique values on the joining column. see this bug.
so you need to verify integrity before joining by , verify_integrity=True
Upvotes: 6