Alex T
Alex T

Reputation: 3754

Matching similar values in DataFrame

I have DataFrame in the following form:

Name    Count
Car     500
Cars    300
Train   100
trainz  200
Planes  1000
Plane   100 
planses 1
Ship    100
ships   10

I'm trying to match the similar values with eachother so that the number in count column is summed for similar values.

Therefore the output dataframe would be the first found value of the given type in name column and summed value over all similar values in count column.

Name  Count
Car    800
Train  300
Planes 1101
Ship   110

Upvotes: 2

Views: 270

Answers (2)

Lore
Lore

Reputation: 1928

You can implement a custom function (maybe use difflib from other answer...) to transform the values in Name to the initial similar value (if exists), and you can apply to the column Name.

Finally, you can use groupby on Name with sum:

df.groupby('Name').agg('sum')

Alternative: with apply, create another numeric column, with same number for similar terms, then use groupby on the new column.

Upvotes: 0

Ilu
Ilu

Reputation: 56

Have a look at difflib.

The following code

import difflib
print(difflib.get_close_matches('Car', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))
print(difflib.get_close_matches('Train', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))
print(difflib.get_close_matches('Planes', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))
print(difflib.get_close_matches('Ship', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))

gives your desired groups

['Car', 'Cars']
['Train', 'trainz']
['Planes', 'Plane', 'planses']
['Ship', 'ships']

Upvotes: 2

Related Questions