Alexis Eggermont
Alexis Eggermont

Reputation: 8145

Joining pandas DataFrames by Column names

I have two DataFrames with the following column names:

frame_1:
event_id, date, time, county_ID

frame_2:
countyid, state

I would like to get a DataFrame with the following columns by left-joining on county_ID = countyid:

joined_dataframe:
event_id, date, time, county, state

I cannot figure out how to do it if the columns on which I want to join are not the index.

Upvotes: 167

Views: 281393

Answers (3)

cottontail
cottontail

Reputation: 23071

Merging using differently named columns duplicates columns; for example, after the call frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid'), both county_ID and countyid columns are created on the joined frame but they have exactly the same values for every row, so presumably only one of them is needed. To not have this problem from the beginning, rename the merge-key column be the same and merge on that column.

df1 = frame_1.rename(columns={'county_ID':'county'})
df2 = frame_2.rename(columns={'countyid':'county'})
joined_frame = df1.merge(df2, on='county', how='left')

res1

Also, if the second frame has only one new additional column (e.g. state) as in the OP, then you can map that column to frame_1 via the common column.

frame_1['state'] = frame_1['county_ID'].map(frame_2.set_index('countyid')['state'])

If county_ID and countyid were indices (not columns), then a straightforward join call does the job. By default, it performs left join.

joined_frame = frame_1.join(frame_2)

res2

One nice thing about join is that if you want to join multiple dataframes on index, then you can pass a list of dataframes and join efficiently (instead of multiple chained merge calls).

joined_frame = frame_1.join([frame_2, frame_3])

Upvotes: 0

Woody Pride
Woody Pride

Reputation: 13955

You can use the left_on and right_on options of pd.merge as follows:

pd.merge(frame_1, frame_2, left_on='county_ID', right_on='countyid')

Or equivalently with DataFrame.merge:

frame_1.merge(frame_2, left_on='county_ID', right_on='countyid')

I was not sure from the question if you only wanted to merge if the key was in the left hand DataFrame. If that is the case then the following will do that (the above will in effect do a many to many merge)

pd.merge(frame_1, frame_2, how='left', left_on='county_ID', right_on='countyid')

Or

frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid')

Upvotes: 292

behzad.nouri
behzad.nouri

Reputation: 77951

you need to make county_ID as index for the right frame:

frame_2.join ( frame_1.set_index( [ 'county_ID' ], verify_integrity=True ),
               on=[ 'countyid' ], how='left' )

for your information, in pandas left join breaks when the right frame has non unique values on the joining column. see this bug.

so you need to verify integrity before joining by , verify_integrity=True

Upvotes: 6

Related Questions