Code Monkey
Code Monkey

Reputation: 69

Merge two dataframes with similar column(s)

I would like to know how to merge/concat two dataframes with similar columns and a few missing values while still being able to retain all the information. My dataframes look similar to this:

df1

Item ID       Order ID      Name         Location
21            1             John         IL
22            2             John          LA
24            2             Ron          LA
25            3             Ron          LA
29            5             John         IL

df2

Item ID       Order ID      Name         Location    Type
20            1             John         IL          cable
22            2             Ron          LA          cable
23            2             Ron          LA          Box
26            3             Ron          LA          cable
27            N/A           N/A          IL          Box
29            5             John         IL          Box

What I want my dataframe to look like

Item ID       Order ID      Name         Location     Type
20            1             John         IL           Cable
21            4             John         IL           N/A
22            2             John/Ron     LA           Cable
23            2             Ron          LA           Box
24            2             Ron          LA           N/A
25            3             Ron          LA           N/A
26            3             Ron          LA           Cable
27            N/A           N/A          IL           Box
28            N/A           N/A          N/A          N/A
29            5             John         IL           Box

I have tried to do something similar to this

dataframes = [df1, df2]
merged = reduce(lambda left,right: pd.merge(left,right,on='Item ID', how='outer'), dataframes)

But the sorting of it is wrong or it leaves some information out as well as not filling in the missing values (Item ID: 28).

Upvotes: 1

Views: 1169

Answers (3)

Code Monkey
Code Monkey

Reputation: 69

I found this on another posting and with a little change it did what i was looking for. Ill post the definition version for someone who needs it aswell.

# combine the common columns
def merge_dfs(dfs):
df1 = dfs[0]
df2= dfs[1]

left= df1
right = df2

keyCol = 'Request ID'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
#print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))

mergeDf = left.merge(right, on=keyCol, how='outer', suffixes=('_left', '_right'))


   # combine the common columns
for col in commonCols:
    if col != keyCol:
        for i, row in mergeDf.iterrows():
            leftVal = str(row[col + '_left']).replace('nan', "").strip()
            rightVal = str(row[col + '_right']).replace('nan', "").strip()
            #print(leftVal + ',' + rightVal)
            if leftVal == rightVal:
                mergeDf.loc[i, col] = leftVal
            else:
                mergeDf.loc[i, col] = leftVal + "~" + rightVal

# only use the finalCols
mergeDf = mergeDf[finalCols]
for df in dfs[2:]:
    df1 = mergeDf
    df2= df

    left= df1
    right = df2

    keyCol ='Request ID'
    commonCols = list(set(left.columns & right.columns))
    finalCols = list(set(left.columns | right.columns))
    #print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))

    mergeDf = left.merge(right, on=keyCol, how='outer', suffixes=('_left', '_right'))


       # combine the common columns
    for col in commonCols:
        if col != keyCol:
            for i, row in mergeDf.iterrows():
                leftVal = str(row[col + '_left']).replace('nan', "").strip()
                rightVal = str(row[col + '_right']).replace('nan', "").strip()
                #print(leftVal + ',' + rightVal)
                leftValWords = leftVal.split('~')
                #print(leftValWords)
                if rightVal in leftValWords:
                    mergeDf.loc[i, col] = leftVal
                else:
                    mergeDf.loc[i, col] = leftVal + '~' + rightVal

# only use the finalCols
    mergeDf = mergeDf[finalCols]
    mergeDf = mergeDf
return mergeDf

Upvotes: 0

anky
anky

Reputation: 75080

Another way if you want to populate missing values is using reindex and combine_first:

l=pd.concat((df1['Item ID'],df2['Item ID']))
final=(df1.set_index('Item ID').reindex(range(l.min(),l.max()+1))
    .combine_first(df2.set_index('Item ID')).reset_index().reindex(columns=df2.columns))

   Item ID  Order ID  Name Location   Type
0       20       1.0  John       IL  cable
1       21       1.0  John       IL    NaN
2       22       2.0   Ron       LA  cable
3       23       2.0   Ron       LA    Box
4       24       2.0   Ron       LA    NaN
5       25       3.0   Ron       LA    NaN
6       26       3.0   Ron       LA  cable
7       27       NaN   NaN       IL    Box
8       28       NaN   NaN      NaN    NaN
9       29       5.0  John       IL    Box

Upvotes: 1

Kenan
Kenan

Reputation: 14094

This could work

pd.concat([df1, df2]).sort_values('Item_ID').drop_duplicates(['Item_ID'], keep='last')

   Item_ID Location  Name  Order_ID   Type
0       20       IL  John       1.0  cable
0       21       IL  John       1.0    NaN
1       22       LA   Ron       2.0  cable
2       23       LA   Ron       2.0    Box
2       24       LA   Ron       2.0    NaN
3       25       LA   Ron       3.0    NaN
3       26       LA   Ron       3.0  cable
4       27       IL   NaN       NaN    Box
5       29       IL  John       5.0    Box

Upvotes: 1

Related Questions