Reputation: 1695
I have a Pandas dataframe that collects the names of vendors at which a transaction was made. As this data is automatically collected from bank statements, lots of the vendors are similar... but not quite the same. In summary, I want to replace the different permutations of the vendors' names with a single name.
I think I can work out a way to do it (see below), but I'm a beginner and this seems to me like it's a complex problem. I'd be really interested to see how more experienced coders would approach it.
I have a dataframe like this (in real life, it's about 20 columns and a maximum of around 50 rows):
Groceries Car Luxuries
0 Sainsburys Texaco wst453 Amazon
1 Sainsburys bur Texaco east Firebox Ltd
2 Sainsbury's east Shell wstl Sony
3 Tesco Shell p/stn Sony ent nrk
4 Tescos ref 657 Texac Amazon EU
5 Tesco 45783 Moto Amazon marketplace
I'd like to find the similar entries and replace them with the first instance of those entries, so I'd end up with this:
Groceries Car Luxuries
0 Sainsburys Texaco wst453 Amazon
1 Sainsburys Texaco wst453 Firebox Ltd
2 Sainsburys Shell wstl Sony
3 Tesco Shell wstl Sony
4 Tesco Texaco wst453 Amazon
5 Tesco Moto Amazon
My solution might be far from optimum. I was thinking of sorting alphabetically, then going through bitwise and using something like SequenceMatcher from difflib to compare each pair of vendors. If the similarity is above a certain percentage (I'm expecting to play with this value until I'm happy) then the two vendors will be assumed to be the same. I'm concerned that I might be using a sledgehammer to crack a nut, or it might take a long time (I'm not obsessed with performance, but equally I don't want to wait hours for the result).
Really interested to hear people's thoughts on this problem!
Upvotes: 2
Views: 2262
Reputation: 4612
At the start, the problem doesn't seem complicated, but it is.
I used string similarity package named fuzzywuzzy to decide which string must be replaced. This package uses Levenshtein Similarity, and I used %90 as the threshold value. Also, the first word of any string is used as comparison string. Here is my code:
import pandas
from fuzzywuzzy import fuzz
# Replaces %90 and more similar strings
def func(input_list):
for count, item in enumerate(input_list):
rest_of_input_list = input_list[:count] + input_list[count + 1:]
new_list = []
for other_item in rest_of_input_list:
similarity = fuzz.ratio(item, other_item)
if similarity >= 90:
new_list.append(item)
else:
new_list.append(other_item)
input_list = new_list[:count] + [item] + new_list[count :]
return input_list
df = pandas.read_csv('input.txt') # Read data from csv
result = []
for column in list(df):
column_values = list(df[column])
first_words = [x[:x.index(" ")] if " " in x else x for x in column_values]
result.append(func(first_words))
new_df = pandas.DataFrame(result).transpose()
new_df.columns = list(df)
print(new_df)
Output:
Groceries Car Luxuries
0 Sainsbury's Texac Amazon
1 Sainsbury's Texac Firebox
2 Sainsbury's Shell Sony
3 Tesco Shell Sony
4 Tesco Texac Amazon
5 Tesco Moto Amazon
UPDATE:
More readable version of func
, which produces the same result:
def func(input_list):
for i in range(len(input_list)):
for j in range(len(input_list)):
if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
input_list[i] = input_list[j] # Keep the last encountered item
# Use following line to keep the first encountered item
# input_list[j] = input_list[i]
Upvotes: 5