Reputation: 43
I have 2 dataframes df1 and df2 which have the same number of rows but in a different ordering. The dataframes have an ID column which contains a unique identifier for each row, both dataframes have this ID column and this is the column for which I would like the order of the rows in df2 to match in df1. The reason I want the order to be the same is that the dataframes will be passed to sklearn train_test_split and I wish to split the dataframes in the exact same order. How can I do this?
Here is the first dataframe:
df1
Paper ID Document Label
math.123.txt df1 content 1 0
math.321.txt df1 content 2 0
math.213.txt df1 content 3 0
df2
Paper ID Document Label
math.123.txt df2 content 1 0
math.213.txt df2 content 2 0
math.321.txt df2 content 3 0
desired order of df2:
df2
Paper ID Document Label
math.123.txt df2 content 1 0
math.321.txt df2 content 3 0
math.213.txt df2 content 2 0
So essentially I just want to reorder the rows in df2 based on the order of rows in df1["Paper ID"]
Upvotes: 3
Views: 3400
Reputation: 260600
You can use reindex
using 'Paper ID' as index, here is a pipeline version:
df2 = (df2.set_index('Paper ID')
.reindex(df1.set_index('Paper ID').index)
.reset_index()
)
Or, if your values are unique, you can try to perform a left join on a df1 with only the key column (untested):
pd.merge(df1[['Paper ID']], df2,
how='left', on='Paper ID',
sort=False
)
Upvotes: 3
Reputation: 5164
Since you want to order the dataframes according to the Paper ID, you should first set them as the index in both dataframes:
df1.set_index('Paper ID', inplace=True)
df2.set_index('Paper ID', inplace=True)
Now you can reindex df2
to match the order of df1
:
df2 = df2.reindex(df1.index)
Finally, reset the indices to restore the default index:
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
Upvotes: 9