Reputation: 7335
I have DataFrame A (df_cam
) with cli id and origin:
cli id | origin
------------------------------------
123 | 1234 M-MKT XYZklm 05/2016
And DataFrame B (df_dict
) with shortcut and campaign
shortcut | campaign
------------------------------------
M-MKT | Mobile Marketing Outbound
I know that for example client with origin 1234 M-MKT XYZklm 05/2016
is actually from campaign Mobile Marketing Outbound
because it contains key word M-MKT
.
Note that shortcut is a general key word, based on what the algorithm should decide. The origin can as well be M-Marketing
, MMKT
or Mob-MKT
. I created the list of shortcuts manually by analyzing all the origins in the first place. I am also using regex to clean the origin
before it is fetched to the program.
I'd like to match customer origin with campaign through the shortcut and attach score to see the difference. As showed below:
cli id | shortcut | origin | campaign | Score
---------------------------------------------------------------------------------
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93
Below is my program which works, but is really slow. DataFrame A has ~400.000 rows and another DataFrame B has ~40 rows.
Is there a way I can make it faster?
from fuzzywuzzy import fuzz
list_values = df_dict['Shortcut'].values.tolist()
def TopFuzzMatch(tokenA, dict_, position, value):
"""
Calculates similarity between two tokens and returns TOP match and score
-----------------------------------------------------------------------
tokenA: similarity to this token will be calculated
dict_a: list with shortcuts
position: whether I want first, second, third...TOP position
value: 0=similarity score, 1=associated shortcut
-----------------------------------------------------------------------
"""
sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_]
sim.sort(key=lambda tup: tup[0], reverse=True)
return sim[position][value]
df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1 )
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1 )
Note that I'd like to calculate also 2nd and 3rd best match to evaluate the accuracy.
EDIT
I have found process.ExtractOne
method, but the speed remains same.
So my code looks like this now:
def TopFuzzMatch(token, dict_, value):
score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio)
return score[value]
Upvotes: 0
Views: 1109
Reputation: 7335
I found a solution - after i clean the origin column with regex (no numbers and special characters), there are just a few hundred repeating distinct values, so I calculate the Fuzz algorithm just on those, which significantly improves the time.
def TopFuzzMatch(df_cam, df_dict):
"""
Calculates similarity bewteen two tokens and return TOP match
The idea is to do it only over distinct values in given DF (takes ages otherwise)
-----------------------------------------------------------------------
df_cam: DataFrame with client id and origin
df_dict: DataFrame with abbreviation which is matched with the description i need
-----------------------------------------------------------------------
"""
#Clean special characters and numbers
df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)
#Get unique values and calculate similarity
uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]
#To DataFrame
df_match = pd.DataFrame({'unique': uq_origin})
df_match['top_match'] = top_match
#Merge
df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')
return df_cam
df_out = TopFuzzMatch(df_cam, df_dict)
Upvotes: 1