feration48
feration48

Reputation: 155

Matching strings within two lists

Here is the problem: I want to define function which will compare string ratios using fuzzy.ration() within 2 lists (not same size). It should return entities from list 1, which have at least one ratio bigger than 60 compared with second.

def Matching(list1, list2):
    no_matching = []
    matching = []
    for item1 in list1:    
        for item2 in list2:        
            m_score = fuzz.ratio(item1, item2)
        if m.score < 60:
            no_matching.append(item1)
        if m.score > 60:
            matching.append(item1)
    return(matching, no_matching)

The output is not what I aim for. Which part am I doing wrong - in order to get only the items from list 1 if they have at least one matching from list 2 bigger than 60.

For example:

list1 = ["Real Madrid", "Benfica", "Lazio", "FC Milan"]
list2 = ["Madrid", "Barcelona", "Milan"]

for item1 in list1:
    for item2 in list2:
        m_score = fuzz.ratio(item1, item2)
        print(item1, "&", item1, m_score)

Output is:

Real Madrid & Madrid 71 # greater than 60
Real Madrid & Barcelona 20
Real Madrid & Milan 12
Benfica & Madrid 15
Benfica & Barcelona 50
Benfica & Milan 17
Lazio & Madrid 36
Lazio & Barcelona 29
Lazio & Milan 20
FC Milan & Madrid 29
FC Milan & Barcelona 24
FC Milan & Milan 77 # greater than 60

The function output should be:

matching = ["Real Madrid", "FC Milan"] # since they have at least one ratio bigger than 60
no_matching = ["Benfica", "Lazio"]

Upvotes: 2

Views: 1600

Answers (3)

maxbachmann
maxbachmann

Reputation: 3265

Especially given your comment about doing this on large lists (10.000 * 1.200), I would recommend the usage of RapidFuzz (I am the author). A solution using RapidFuzz could be achieved in the following way:

from rapidfuzz import process, fuzz
import numpy as np

list1 = ["Real Madrid", "Benfica", "Lazio", "FC Milan"]
list2 = ["Madrid", "Barcelona", "Milan"]

scores = process.cdist(
    list1, list2, scorer=fuzz.ratio,
    dtype=np.uint8, score_cutoff=60)
# scores is array([[71,  0,  0],
#                  [ 0,  0,  0],
#                  [ 0,  0,  0],
#                  [ 0,  0, 77]], dtype=uint8)

matches = np.any(scores, 1)
# matches is array([ True, False, False,  True])

This still processes the whole N*M matrix, but it is significantly faster than doing the same using fuzzywuzzy/thefuzz. When working with really large lists it is possible to enable multithreading in process.cdist by passing the named argument workers (e.g. workers=-1 to use all available cores). The results above could be converted to the lists you showed in the example if that is needed:

matching = [x for x, is_match in zip(list1, matches) if is_match]
# ['Real Madrid', 'FC Milan']
not_matching = [x for x, is_match in zip(list1, matches) if not is_match]
# ['Benfica', 'Lazio']

I benchmarked this solution on an i7-8550U using two large lists (10.000 * 1.200):

print(timeit(
"""
scores = process.cdist(
    list1, list2, scorer=fuzz.ratio,
    dtype=np.uint8, score_cutoff=60)

matches = np.any(scores, 1)

matching = [x for x, is_match in zip(list1, matches) if is_match]
not_matching = [x for x, is_match in zip(list1, matches) if not is_match]
""",
setup="""
from rapidfuzz import process, fuzz
import numpy as np

list1 = ["Real Madrid", "Benfica", "Lazio", "FC Milan"] * 2500
list2 = ["Madrid", "Barcelona", "Milan"] * 400
""", number=1
))

which took 0.33 seconds. Using workers=-1 reduced the runtime to 0.08 seconds.

Upvotes: 7

Tal Folkman
Tal Folkman

Reputation: 2561

Edit: instead of run 2 for loops, you can run over all he combinations:

import itertools
new_list = list(itertools.product(list1, list2))

output:

[('Real Madrid', 'Madrid'), ('Real Madrid', 'Barcelona'), ('Real Madrid', 'Milan'), ('Benfica', 'Madrid'), ('Benfica', 'Barcelona'), ('Benfica', 'Milan'), ('Lazio', 'Madrid'), ('Lazio', 'Barcelona'), ('Lazio', 'Milan'), ('FC Milan', 'Madrid'), ('FC Milan', 'Barcelona'), ('FC Milan', 'Milan')]

you have a problem with indentation:

from fuzzywuzzy import fuzz

def Matching(list1, list2):
    no_matching = []
    matching = []
    m_score = 0
    for item1 in list1:    
        for item2 in list2:        
            m_score = fuzz.ratio(item1, item2)
            if m_score > 60:
                matching.append(item1)
        if m_score < 60 and not(item1 in matching):
            no_matching.append(item1)
    return(matching, no_matching)

Upvotes: 1

Kabilan Mohanraj
Kabilan Mohanraj

Reputation: 1906

There are duplicate combinations in list1 and list2 that created copies in the no_matching list. Check if the element is already in the matching list. If yes, don't add to the no_matching list. The below code gives the expected output.

from fuzzywuzzy import fuzz

def Matching(list1, list2):
    no_matching = []
    matching = []
    m_score = 0
    for item1 in list1:    
        for item2 in list2:        
            m_score = fuzz.ratio(item1, item2)
            if m_score > 60:
                matching.append(item1)
        if m_score < 60 and not(item1 in matching):
            no_matching.append(item1)
    return(matching, no_matching)


list1 = ["Real Madrid", "Benfica", "Lazio", "FC Milan"]
list2 = ["Madrid", "Barcelona", "Milan"]
print(Matching(list1, list2))

Output:

(['Real Madrid', 'FC Milan'], ['Benfica', 'Lazio'])

Upvotes: 1

Related Questions