Dolarious
Dolarious

Reputation: 81

Merge dataframes and remove duplicate columns

I have been working on just this the whole day but couldn't come up with better solution. I have list of dataframes

ls=[df1,df2,df3,df4] and tried several methods to merge.

df= reduce(lambda left,right: pd.merge(left,right,on='Time',how='outer'), ls)

Most of the examples I read just drop the col_y columns and change the col_x columns name. How can I achieve the desired dataframe, thanks in advance.

df1

Time col1 col2
 1  a   11
 2  b   12
 3  c   13
 4  d   14

df2

Time col3 col1
1   11    na
2   15    na
3   66    na
4   78    na
5   33    f
6   22    g

Mergd df3

 Time   col1_x  col2    col3    col1_y
 1      a      11     11           na
 2      b      12     15           na
 3      c      13     66           na
 4      d      14     78           na
 5     na      na     33            f
 6     na      na     22            g

What I wanted the final df to be:

Desired merged df

 Time   col1    col2    col3    
 1      a      11     11           
 2      b      12     15           
 3      c      13     66           
 4      d      14     78           
 5      f      na     33            
 6      g      na     22            

Upvotes: 3

Views: 6182

Answers (2)

BallpointBen
BallpointBen

Reputation: 13934

You can take advantage the fact when dataframes share an index, assignments and replacements align on the index before operating.

df1 = pd.DataFrame.from_records(
    [
        {"Time": 1, "col1": "a", "col2": 11},
        {"Time": 2, "col1": "b", "col2": 12},
        {"Time": 3, "col1": "c", "col2": 13},
        {"Time": 4, "col1": "d", "col2": 14},
    ]
).set_index("Time")

df2 = pd.DataFrame.from_records(
    [
        {"Time": 1, "col3": 11, "col1": np.nan},
        {"Time": 2, "col3": 15, "col1": np.nan},
        {"Time": 3, "col3": 66, "col1": np.nan},
        {"Time": 4, "col3": 78, "col1": np.nan},
        {"Time": 5, "col3": 33, "col1": "f"},
        {"Time": 6, "col3": 22, "col1": "g"},
    ]
).set_index("Time")

dfs = [df1, df2]

index = pd.Index(dfs[0].index)
for this_df in dfs[1:]:
    index = index.union(this_df.index)

df = pd.DataFrame(index=index)

for this_df in dfs:
    for col in this_df.columns:
        if col not in df.columns:
            df[col] = this_df[col]
        else:
            df[col] = df[col].fillna(this_df[col])

print(df)

     col1  col2  col3
Time                 
1       a  11.0    11
2       b  12.0    15
3       c  13.0    66
4       d  14.0    78
5       f   NaN    33
6       g   NaN    22

Upvotes: 2

Space Impact
Space Impact

Reputation: 13255

Use combine_first with drop as, if you have more than 2 dataframes do this operation after merging and cascade the combine_first with df3 and df4:

df = df1.merge(df2, how='outer', on=['Time'])
df['col1'] = df['col1_x'].combine_first(df['col1_y'])
df = df.drop(['col1_x', 'col1_y'], axis=1)
#or alternative solution
df1 = df1.set_index('time')
df2 = df2.set_index('time')
df = df1.combine_first(df2)

print(df)
   Time  col2  col3 col1
0     1  11.0    11    a
1     2  12.0    15    b
2     3  13.0    66    c
3     4  14.0    78    d
4     5   NaN    33    f
5     6   NaN    22    g

You can use combine_first of dataframe operation after setting time as index like:

df1 = df1.set_index('time')
df2 = df2.set_index('time')
df3 = df3.set_index('time')
df4 = df4.set_index('time')
df = df1.combine_first(df2).combine_first(df3).combine_first(df4)

Using reduce function:

df = reduce(lambda left,right: left.set_index('time').combine_first(right.set_index('time')), 
            ls)

Upvotes: 4

Related Questions