Reputation: 185
I have a list of company names which are not properly aligned. Data set looks like
df[Name]= [Google, google, Google.inc, Google Inc., Google.com]
I have about 500,000 rows and name should be corrected with best way possible.
My code looks like below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
get_match = []
for row in df.index:
name1= df.get_value(row,"Name")
for columns in df2.index:
name2=df2.get_value(columns,"Name")
matched_token=[process.extract(x, name2, limit=3) for x in name1]
get_match.append([matched_token, name1, name2])
df_maneet = pd.DataFrame({'Ratio': [i[0] for i in get_match], 'name1': [i[1] for i in get_match], 'name2':[i[2] for i in get_match]})
My result in matched_token is
[[('google', 100, 0), ('Sxyzdgg.', 48, 9), ('ggigsk', 45, 2)]]
but I want to append result in df and see result like below.
I think I am running something wrong in matched.token line, but can't figure out.
Thanks in advance
Upvotes: 0
Views: 1188
Reputation: 93
Maybe this code will help you:
import pandas as pd
df = pd.DataFrame({"Name" : ["Google","google.inc"]})
df2 = pd.DataFrame({"Name" : ["google","google"]})
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
get_match = []
for row in df.index:
name1 = []
name1.append(df.get_value(row,"Name"))
for columns in df2.index:
name2 = []
name2.append(df2.get_value(columns,"Name") )
matched_token=[process.extract(x, name2, limit=3)[0][1] for x in name1]
get_match.append([matched_token, name1[0], name2[0]])
df_maneet = pd.DataFrame({'name1': [i[1] for i in get_match], 'name2':[i[2] for i in get_match], 'Ratio': [i[0][0] for i in get_match]})
Final dataframe:
name1 name2 Ratio
0 Google google 100
1 Google google 100
2 google.inc google 90
3 google.inc google 90
Upvotes: 1