How to improve the time for searching a list over another list?

Question

I was trying to change the value of the column 'names' on a dataframe for specific members of the dataframe. I'm trying to reduce the len(names) via labeling similar names with the same name, this is done with fuzzywuzzy. I have tried to figure out a way to do it with a nested loop:

for name in names:    
    for index in df_faces['Nombre'].index:
        name2 = df_faces.loc[index,'Nombre']
        try:            
            if fuzz.ratio(name, name2) >90:                
                df_faces.loc[index,'Nombre'] = name      
        except:
            pass

Where names is a list and df_faces is a data frame nxm-table-like, this is taking so long, because the data frame is about 1.2 millions entries and names is about 1k.

edit: what happen when i drop exceptions? well i guess some of the names are type float,i got this error and is in the fuzzywuzzy, should i transform the type of the data in order to drop the exceptions?

edit2: when i use the check_name(x) function i get this error, actually i cant figure out whats wrong

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
 in 
      3     return next(gen) if any(gen) else x
      4 
----> 5 df_faces.Nombre = df_faces.Nombre.apply(lambda x: check_name(x))

~/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4040             else:
   4041                 values = self.astype(object).values
-> 4042                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4043 
   4044         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

 in (x)
      3     return next(gen) if any(gen) else x
      4 
----> 5 df_faces.Nombre = df_faces.Nombre.apply(lambda x: check_name(x))

 in check_name(x)
      1 def check_name(x):
      2     gen = (name for name in names if fuzz.ratio(name, x) > 90)
----> 3     return next(gen) if any(gen) else x
      4 
      5 df_faces.Nombre = df_faces.Nombre.apply(lambda x: check_name(x))

StopIteration:

Massifox · Accepted Answer

Try this code:

def check_name(x):
    return next((name for name in names if fuzz.ratio(name, x) > 90), x)

df_faces.Nombre = df_faces.Nombre.apply(lambda x: check_name(x))

The check_name() function takes advantage of the generators and iterates to when the first value satisfying the condition fuzz.ratio(name, x) > 90 and returns it, if there are no matches it returns x.

Through the DataFrame.apply function, we vectorize the calculation on the dataframe and obtain the desired result efficiently.

I did tests on instances of a few tens of thousands of dataframe elements and a few hundred elements in the list of names and the solution I posted was about 6 times faster than the code you proposed in your application.

Explanation

The bottleneck of your algorithm is certainly given by non-vectorization. Iteration and pandas assignment operations are very slow, which is why it is good practice to vectorize your code whenever possible.

Vectorization is the process of executing operations on entire arrays. The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.

Other suggestions: You could keep a map (dictionary or what you think is most appropriate) of the most common names {key:name_from_df, value:name_from_list}. This way you can search the map before calculating the fuzzy ratio. If the name is on the map you will have a time O(log m), with m the size of the map. It's up to you to choose an appropriate m for your problem.

How to improve the time for searching a list over another list?

Answers (1)

Explanation

Related Questions