Reputation: 115
I am trying to cycle through the values of a column in pandas and change all similar values so they are harmonised. I have first extracted the column as a list and want to loop through for each row, replace the similar value when found with the value which is similar, then place the list back into dataframe replacing the column. so for example a column like:
Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss
would become:
CoOl
Awesome
coOol
CoOl
Awesome
Awesome
Mathss
Mathss
Mathss
Mathss
The code is as follows:
def matchbrands():
conn = sqlite3.connect('/Users/XXX/db.sqlite3')
c = conn.cursor()
matchbrands_df = pd.read_sql_query("SELECT * from removeduplicates", conn)
brands = [x for x in matchbrands_df['brand']]
i=1
for x in brands:
if fuzz.token_sort_ratio(x, brands[i]) > 85:
x = brands[i]
else:
i += 1
n = matchbrands_df.columns[7]
matchbrands_df.drop(n, axis=1, inplace=True)
matchbrands_df[n] = brands
matchbrands_df.to_csv('/Users/XXX/matchedbrands.csv')
matchbrands_df.to_sql('removeduplicates', conn, if_exists="replace")
however this does not change the column at all. I am unsure why. Any help would be appreciated
Upvotes: 0
Views: 890
Reputation: 142631
Your code makes no sense.
First: using x =...
you can't change value on list brands
. You need brands[index] = ...
Second: it needs nested for
-loop to compare x
with all other words in brands
for index, word in enumerate(brands):
for other in brands[index+1:]:
#print(word, other, fuzz.token_sort_ratio(word, other))
if fuzz.token_sort_ratio(word, other) > 85:
brands[index] = other
Minimal working code
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
data = {'brands':
'''Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss'''.split('\n')
} # rows
df = pd.DataFrame(data)
print('--- before ---')
print(df)
brands = df['brands'].to_list()
print('--- changes ---')
for index, word in enumerate(brands):
#for other_index, other_word in enumerate(brands):
for other_index, other_word in enumerate(brands[index+1:], index+1):
#if word != other_word:
result = fuzz.token_sort_ratio(word, other_word)
if result > 85:
print(f'OK | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
elif result > 50:
print(f' | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}')
if result > 85:
brands[index] = other_word
#break
#word = other_word
df['brands'] = brands
print('--- after ---')
print(df)
Result:
--- before ---
brands
0 Cool
1 Awesome
2 cool
3 CoOl
4 Awesum
5 Awesome
6 Mathss
7 Math
8 Maths
9 Mathss
--- changes ---
OK | 100 | 0 Cool -> 2 cool
OK | 100 | 0 Cool -> 3 CoOl
| 77 | 1 Awesome -> 4 Awesum
OK | 100 | 1 Awesome -> 5 Awesome
OK | 100 | 2 cool -> 3 CoOl
| 77 | 4 Awesum -> 5 Awesome
| 80 | 6 Mathss -> 7 Math
OK | 91 | 6 Mathss -> 8 Maths
OK | 100 | 6 Mathss -> 9 Mathss
OK | 89 | 7 Math -> 8 Maths
| 80 | 7 Math -> 9 Mathss
OK | 91 | 8 Maths -> 9 Mathss
--- after ---
brands
0 CoOl
1 Awesome
2 CoOl
3 CoOl
4 Awesum
5 Awesome
6 Mathss
7 Maths
8 Mathss
9 Mathss
It doesn't change Awesum
into Awesome
because it gets 77
It doesn't change Math
into Mathss
because it gets 80
. But it gets 89
for Maths
.
If you use word = other_word
in for
-loop then it can convert Math
to Maths
(89
) and next this Maths
to Mathss
(91
). But this way it may change many times and finally it change to word which originally can gives value much smaller then 85
. Expected result you can get also for 75
instead of 85
.
But this method get last word with value >85
, not with the bigest value - so there can be better matching word and it will not use it. Using break it get first word with >85
. maybe it should get all words with >85
and choose word with biggest value. And it would have to skip words which are the same but in different row. But all this can make strange situations.
In comments in code I keep other ideas for modifications.
EDIT:
The same with >75
and with colors.
import pandas as pd
import fuzzywuzzy.fuzz as fuzz
from colorama import Fore as FG, Back as BG, Style as ST
data = {'brands':
'''Cool
Awesome
cool
CoOl
Awesum
Awesome
Mathss
Math
Maths
Mathss'''.split('\n')
} # rows
df = pd.DataFrame(data)
print('--- before ---')
print(df)
brands = df['brands'].to_list()
print('--- changes ---')
for index, word in enumerate(brands):
print('-', index, '-')
#for other_index, other_word in enumerate(brands):
for other_index, other_word in enumerate(brands[index+1:], index+1):
#if word != other_word:
result = fuzz.token_sort_ratio(word, other_word)
if result > 85:
color = ST.BRIGHT + FG.GREEN
info = 'OK'
elif result > 75:
color = ST.BRIGHT + FG.YELLOW
info = ' ?'
elif result > 50:
color = ST.BRIGHT + FG.WHITE
info = ' '
else:
color = ST.BRIGHT + FG.RED
info = ' -'
print(f'{color}{info} | {result:3} | {index:2} {word:7} -> {other_index:2} {other_word}{ST.RESET_ALL}')
if result > 75:
brands[index] = other_word
#break
#word = other_word
df['brands'] = brands
print('--- after ---')
print(df)
Upvotes: 1