Reputation: 3754
I have DataFrame in the following form:
Name Count
Car 500
Cars 300
Train 100
trainz 200
Planes 1000
Plane 100
planses 1
Ship 100
ships 10
I'm trying to match the similar values with eachother so that the number in count
column is summed for similar values.
Therefore the output dataframe would be the first found value of the given type in name
column and summed value over all similar values in count
column.
Name Count
Car 800
Train 300
Planes 1101
Ship 110
Upvotes: 2
Views: 270
Reputation: 1928
You can implement a custom function (maybe use difflib from other answer...) to transform the values in Name
to the initial similar value (if exists), and you can apply to the column Name
.
Finally, you can use groupby on Name
with sum
:
df.groupby('Name').agg('sum')
Alternative: with apply, create another numeric column, with same number for similar terms, then use groupby on the new column.
Upvotes: 0
Reputation: 56
Have a look at difflib.
The following code
import difflib
print(difflib.get_close_matches('Car', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))
print(difflib.get_close_matches('Train', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))
print(difflib.get_close_matches('Planes', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))
print(difflib.get_close_matches('Ship', ['Car', 'Cars', 'Train', 'trainz', 'Planes', 'Plane', 'planses', 'Ship', 'ships']))
gives your desired groups
['Car', 'Cars']
['Train', 'trainz']
['Planes', 'Plane', 'planses']
['Ship', 'ships']
Upvotes: 2