Reputation: 3272
I'm merging 5 files. Each file has certain column names which are same across the other files. I used this answer to perform the concat operation. There is one column which is unique in all files i.e ID. When I merge the columns I'm noticing the suffix(_x, _y) being applied to end of the column names(due to merge functionality) which are repeating. How can I merge these columns into a single column (Irrespective of null values).
eg:
Dataframe 1
ID Name Age LAN_NBR
1 ABC 24 G284992
Dataframe 2
ID Name Street City State TYPE
2 John Wacker Chicago IL HUB
Dataframe 3
ID CLOSE_DATE TYPE
3 1/1/2021 HUB
Dataframe 4
ID TYPE LAN_NBR
1 HUB G284992
Expected Output
ID Name Age LAN_NBR Street City State TYPE CLOSE_DATE
1 ABC 24 G284992 HUB
2 John Wacker Chicago IL HUB
3 HUB 1/1/2021
Code
obj1=pd.read_excel("file1.xlsx")
obj2=pd.read_excel("file2.xlsx")
obj3=pd.read_excel("file3.xlsx")
obj4=pd.read_excel("file4.xlsx")
obj5=pd.read_excel("file5.xlsx")
obj1_ID=pd.DataFrame(obj1["ID"])
obj2_ID=pd.DataFrame(obj2["ID"])
obj3_ID=pd.DataFrame(obj3["ID"])
obj4_ID=pd.DataFrame(obj4["ID"])
obj5_ID=pd.DataFrame(obj5["ID"])
concat_pd=[obj1_ID,obj2_ID,obj3_ID,obj4_ID,obj5_ID]
obj_final=pd.concat(concat_pd).fillna('')
obj_final.obj_final.drop_duplicates(subset='ID',keep='first')
merge1=pd.merge(left=obj_final, right=obj1, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj2, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj3, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj4, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj5, on="ID", how="left")
Upvotes: 2
Views: 162
Reputation: 150735
Option 1: concat
and drop_duplicates
:
(pd.concat( (df1, df2, df3, df4), sort=False)
.groupby('ID')
.bfill()
.drop_duplicates('ID')
.reset_index(drop=True)
)
Output:
ID Name Age LAN_NBR Street City State TYPE CLOSE_DATE
0 1 ABC 24.0 G284992 NaN NaN NaN HUB NaN
1 2 John NaN NaN Wacker Chicago IL HUB NaN
2 3 NaN NaN NaN NaN NaN NaN HUB 1/1/2021
Option 2: use combine_first
as @ansev suggested:
ret_df = pd.DataFrame()
for d in (df1, df2, df3, df4):
ret_df = ret_df.combine_first(d.set_index('ID') )
ret_df = ret_df.reset_index()
Output:
ID Age CLOSE_DATE City LAN_NBR Name State Street TYPE
0 1 24.0 NaN NaN G284992 ABC NaN NaN HUB
1 2 NaN NaN Chicago NaN John IL Wacker HUB
2 3 NaN 1/1/2021 NaN NaN NaN NaN NaN HUB
Upvotes: 2
Reputation: 29635
you can use concat
and than groupby
with first
to keep the first non null value available if any per ID.
print (pd.concat([df1, df2, df3, df4]).groupby('ID').first())
Name Age LAN_NBR Street City State TYPE CLOSE_DATE
ID
1 ABC 24.0 G284992 NaN NaN NaN HUB NaN
2 John NaN NaN Wacker Chicago IL HUB NaN
3 NaN NaN NaN NaN NaN NaN HUB 1/1/2021
Upvotes: 2