Reputation: 25
I have a fairly large excel file (5000 rows) with some company information. Many companies have been added multiple times, but all of them are missing some columns. My goal is to merge all duplicate companies without losing any of the information from other columns.
Example:
My search results have only showed how to combine two different data frames and how to merge cells if there is no risk of overlap.
(https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)
Probably the closest answer is in this one.
It does, however, merge the different values into single cell instead of splitting them into multiple different cells.
Upvotes: 1
Views: 625
Reputation: 3331
You could use a groupby() with the agg option to specify what you want to do with each column. Here is what it could look like :
df = pd.DataFrame({'Company' : ['Apple', 'Google', 'Apple', 'Microsoft'],
'Score' : [7, np.nan, 6, 8],
'Employees' : [8888, 55000, np.nan, np.nan],
'Description': ['Tasty', np.nan, "Likes trees", "Doesn't like apples"]})
df.groupby('Company').agg({'Score': max, 'Employees': max, 'Description':list})
Output :
Score Employees Description
Company
Apple 7.0 8888.0 [Tasty, Likes trees]
Google NaN 55000.0 [nan]
Microsoft 8.0 NaN [Doesn't like apples]
Upvotes: 1