Anssi
Anssi

Reputation: 25

Merging/combining duplicates in single column without losing data in other columns

I have a fairly large excel file (5000 rows) with some company information. Many companies have been added multiple times, but all of them are missing some columns. My goal is to merge all duplicate companies without losing any of the information from other columns.

Example:

enter image description here

My search results have only showed how to combine two different data frames and how to merge cells if there is no risk of overlap.

(https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)

Probably the closest answer is in this one.

It does, however, merge the different values into single cell instead of splitting them into multiple different cells.

Upvotes: 1

Views: 625

Answers (1)

vlemaistre
vlemaistre

Reputation: 3331

You could use a groupby() with the agg option to specify what you want to do with each column. Here is what it could look like :

df = pd.DataFrame({'Company' : ['Apple', 'Google', 'Apple', 'Microsoft'],
                   'Score' : [7, np.nan, 6, 8], 
                   'Employees' : [8888, 55000, np.nan, np.nan],
                   'Description': ['Tasty', np.nan, "Likes trees", "Doesn't like apples"]})

df.groupby('Company').agg({'Score':  max, 'Employees': max, 'Description':list})

Output :

           Score  Employees            Description
Company                                           
Apple        7.0     8888.0   [Tasty, Likes trees]
Google       NaN    55000.0                  [nan]
Microsoft    8.0        NaN  [Doesn't like apples]

Upvotes: 1

Related Questions