Outcast
Outcast

Reputation: 5117

Join/merge dataframes and preserve the row-order

I work in python and pandas.

Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):

# df1
    A   B   C
0   2   8   6
1   5   2   5
2   3   4   9
3   5   1   1

# df2
    A   B   C
0   2   7   NaN
1   5   1   NaN
2   3   3   NaN
3   5   0   NaN

I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):

    A   B   C
0   2   7   NaN
1   5   1   1
2   3   3   NaN
3   5   0   NaN

So basically it is a right-merge/join but with preserving the order of the original right dataframe.

However, if I do this:

df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')

then I get this:

    A   B   C
0   5   1   1.0
1   2   7   NaN
2   3   3   NaN
3   5   0   NaN

So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.

How can I do the join/merge and preserve the row-order too?

The code to create the original dataframes is the following:

import pandas as pd
import numpy as np

columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)

columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)

I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.

Upvotes: 1

Views: 1082

Answers (3)

Outcast
Outcast

Reputation: 5117

One quick way is:

df_2=df_2.set_index(['A','B'])

temp = df_1.set_index(['A','B'])

df_2.update(temp)

df_2.reset_index(inplace=True)

As I discuss above with @jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.

Upvotes: 0

jezrael
jezrael

Reputation: 862611

I think it is bug.

Possible solution with left join:

df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
     A    B    C
0  2.0  7.0  NaN
1  5.0  1.0  1.0
2  3.0  3.0  NaN
3  5.0  0.0  NaN

Upvotes: 2

Alexandre B.
Alexandre B.

Reputation: 5502

You can play with index between the both dataframe

print(df)
#    A  B    C
# 0  5  1  1.0
# 1  2  7  NaN
# 2  3  3  NaN
# 3  5  0  NaN

df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]

print(df)
#    A    B    C
# 0  2  7.0  NaN
# 1  5  1.0  1.0
# 2  3  3.0  NaN
# 3  5  0.0  NaN

Source

Upvotes: 0

Related Questions