Merging/combining duplicates in single column without losing data in other columns

Question

I have a fairly large excel file (5000 rows) with some company information. Many companies have been added multiple times, but all of them are missing some columns. My goal is to merge all duplicate companies without losing any of the information from other columns.

Example:

My search results have only showed how to combine two different data frames and how to merge cells if there is no risk of overlap.

(https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)

Probably the closest answer is in this one.

It does, however, merge the different values into single cell instead of splitting them into multiple different cells.

vlemaistre · Accepted Answer

You could use a groupby() with the agg option to specify what you want to do with each column. Here is what it could look like :

df = pd.DataFrame({'Company' : ['Apple', 'Google', 'Apple', 'Microsoft'],
                   'Score' : [7, np.nan, 6, 8], 
                   'Employees' : [8888, 55000, np.nan, np.nan],
                   'Description': ['Tasty', np.nan, "Likes trees", "Doesn't like apples"]})

df.groupby('Company').agg({'Score':  max, 'Employees': max, 'Description':list})

Output :

           Score  Employees            Description
Company                                           
Apple        7.0     8888.0   [Tasty, Likes trees]
Google       NaN    55000.0                  [nan]
Microsoft    8.0        NaN  [Doesn't like apples]

Merging/combining duplicates in single column without losing data in other columns

Answers (1)

Related Questions