shockwave
shockwave

Reputation: 3272

Python Merge columns from multiple data frames into a single dataframe

I'm merging 5 files. Each file has certain column names which are same across the other files. I used this answer to perform the concat operation. There is one column which is unique in all files i.e ID. When I merge the columns I'm noticing the suffix(_x, _y) being applied to end of the column names(due to merge functionality) which are repeating. How can I merge these columns into a single column (Irrespective of null values).

eg:

Dataframe 1
ID    Name    Age    LAN_NBR
1     ABC     24     G284992

Dataframe 2
ID    Name    Street     City     State    TYPE
2     John    Wacker     Chicago  IL       HUB

Dataframe 3
ID    CLOSE_DATE    TYPE
3     1/1/2021      HUB

Dataframe 4
ID    TYPE    LAN_NBR
1     HUB     G284992

Expected Output

ID    Name    Age    LAN_NBR    Street     City     State    TYPE    CLOSE_DATE
1     ABC     24     G284992                                 HUB
2     John                      Wacker     Chicago  IL       HUB
3                                                            HUB     1/1/2021

Code

obj1=pd.read_excel("file1.xlsx")
obj2=pd.read_excel("file2.xlsx")
obj3=pd.read_excel("file3.xlsx")
obj4=pd.read_excel("file4.xlsx")
obj5=pd.read_excel("file5.xlsx")

obj1_ID=pd.DataFrame(obj1["ID"])
obj2_ID=pd.DataFrame(obj2["ID"])
obj3_ID=pd.DataFrame(obj3["ID"])
obj4_ID=pd.DataFrame(obj4["ID"])
obj5_ID=pd.DataFrame(obj5["ID"])

concat_pd=[obj1_ID,obj2_ID,obj3_ID,obj4_ID,obj5_ID]
obj_final=pd.concat(concat_pd).fillna('')

obj_final.obj_final.drop_duplicates(subset='ID',keep='first')

merge1=pd.merge(left=obj_final, right=obj1, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj2, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj3, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj4, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj5, on="ID", how="left")

Upvotes: 2

Views: 162

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150735

Option 1: concat and drop_duplicates:

(pd.concat( (df1, df2, df3, df4), sort=False)
   .groupby('ID')
   .bfill()
   .drop_duplicates('ID')
   .reset_index(drop=True)
)

Output:

   ID  Name   Age  LAN_NBR  Street     City State TYPE CLOSE_DATE
0   1   ABC  24.0  G284992     NaN      NaN   NaN  HUB        NaN
1   2  John   NaN      NaN  Wacker  Chicago    IL  HUB        NaN
2   3   NaN   NaN      NaN     NaN      NaN   NaN  HUB   1/1/2021

Option 2: use combine_first as @ansev suggested:

ret_df = pd.DataFrame()

for d in (df1, df2, df3, df4):
    ret_df = ret_df.combine_first(d.set_index('ID') )

ret_df = ret_df.reset_index() 

Output:

   ID   Age CLOSE_DATE     City  LAN_NBR  Name State  Street TYPE
0   1  24.0        NaN      NaN  G284992   ABC   NaN     NaN  HUB
1   2   NaN        NaN  Chicago      NaN  John    IL  Wacker  HUB
2   3   NaN   1/1/2021      NaN      NaN   NaN   NaN     NaN  HUB

Upvotes: 2

Ben.T
Ben.T

Reputation: 29635

you can use concat and than groupby with first to keep the first non null value available if any per ID.

print (pd.concat([df1, df2, df3, df4]).groupby('ID').first())
    Name   Age  LAN_NBR  Street     City State TYPE CLOSE_DATE
ID                                                            
1    ABC  24.0  G284992     NaN      NaN   NaN  HUB        NaN
2   John   NaN      NaN  Wacker  Chicago    IL  HUB        NaN
3    NaN   NaN      NaN     NaN      NaN   NaN  HUB   1/1/2021

Upvotes: 2

Related Questions