Slow fuzzy matching between two DataFrames

Question

I have DataFrame A (df_cam) with cli id and origin:

cli id |            origin
------------------------------------
123    | 1234 M-MKT XYZklm 05/2016

And DataFrame B (df_dict) with shortcut and campaign

shortcut |         campaign
------------------------------------
M-MKT    | Mobile Marketing Outbound

I know that for example client with origin 1234 M-MKT XYZklm 05/2016 is actually from campaign Mobile Marketing Outbound because it contains key word M-MKT.

Note that shortcut is a general key word, based on what the algorithm should decide. The origin can as well be M-Marketing, MMKT or Mob-MKT. I created the list of shortcuts manually by analyzing all the origins in the first place. I am also using regex to clean the origin before it is fetched to the program.

I'd like to match customer origin with campaign through the shortcut and attach score to see the difference. As showed below:

cli id | shortcut |         origin            |        campaign           | Score
---------------------------------------------------------------------------------
123    | M-MKT    | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93

Below is my program which works, but is really slow. DataFrame A has ~400.000 rows and another DataFrame B has ~40 rows.

Is there a way I can make it faster?

from fuzzywuzzy import fuzz
list_values = df_dict['Shortcut'].values.tolist()

def TopFuzzMatch(tokenA, dict_, position, value):
    """
    Calculates similarity between two tokens and returns TOP match and score
    -----------------------------------------------------------------------
    tokenA: similarity to this token will be calculated
    dict_a: list with shortcuts
    position: whether I want first, second, third...TOP position
    value: 0=similarity score, 1=associated shortcut
    -----------------------------------------------------------------------
    """
    sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_]
    sim.sort(key=lambda tup: tup[0], reverse=True)
    return sim[position][value]

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1 )
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1 )

Note that I'd like to calculate also 2nd and 3rd best match to evaluate the accuracy.

EDIT

I have found process.ExtractOne method, but the speed remains same. So my code looks like this now:

def TopFuzzMatch(token, dict_, value):
    score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio)
    return score[value]

HonzaB · Accepted Answer

I found a solution - after i clean the origin column with regex (no numbers and special characters), there are just a few hundred repeating distinct values, so I calculate the Fuzz algorithm just on those, which significantly improves the time.

def TopFuzzMatch(df_cam, df_dict):
    """
    Calculates similarity bewteen two tokens and return TOP match
    The idea is to do it only over distinct values in given DF (takes ages otherwise)
    -----------------------------------------------------------------------
    df_cam: DataFrame with client id and origin
    df_dict: DataFrame with abbreviation which is matched with the description i need
    -----------------------------------------------------------------------
    """
    #Clean special characters and numbers
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)

    #Get unique values and calculate similarity
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]

    #To DataFrame
    df_match = pd.DataFrame({'unique': uq_origin})
    df_match['top_match'] = top_match

    #Merge
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')

    return df_cam

df_out = TopFuzzMatch(df_cam, df_dict)

Slow fuzzy matching between two DataFrames

Answers (1)

Related Questions