Reputation: 961
I've gone through and fuzzy matched each element in a list of 20,000+ movie titles with each other element, which returns a value for each pair:
from fuzzywuzzy import fuzz
titles = ['Scary Movie', 'Happy Movie', 'Sappy Movie', 'Crappy Movie']
print fuzz.ratio(titles[2],titles[3])
>>> 91 #/100 higher value denotes closer match
for x in titles:
for y in titles:
fuzzed = fuzz.ratio(x,y)
print "value for %r and %r is %r" % (x, y, fuzzed)
How can I organize this data efficiently? More specifically- how can I get matches to group together based on their match value?
Capturing the return values from the nested loops and then packaging them with x and y into tuples or lists is obviously redundant and messy. I attempted an implementation using classes but I'm missing something.
Upvotes: 0
Views: 397
Reputation: 1069
You only need to iterate over combinations
of the titles since the ratio doesn't depend on the order. This is significantly faster than iterating over a product
of it.
For your list of 20,000 titles, you would iterate over 400 000 000 pairs if you used product
. With combinations
, you are only iterating over 199 990 000.
from fuzzywuzzy import fuzz
import collections
import itertools
titles = ['Scary Movie', 'Happy Movie', 'Sappy Movie', 'Crappy Movie']
Then you can store the ratios in a dictionary where you can look up a ratio to get a set
of pairs with that ratio.
fuzzes_by_ratio = collections.defaultdict(set)
Or in a dictionary where you can look up a frozenset
of pairs and get their ratio.
fuzzes_per_pair = {}
-
for m1, m2 in itertools.combinations(titles, 2):
pair = frozenset({m1, m2})
ratio = fuzz.ratio(m1, m2)
fuzzes_by_ratio[ratio].add(pair)
fuzzes_per_pair[pair] = ratio
Then you can retrieve the data later:
# I don't have fuzzywuzzy installed so this is just made up:
>>> fuzzes_by_ratio[91]
{frozenset({"Scary Movie", "Happy Movie"}), frozenset({"Sappy Movie", "Happy Movie"})}
>>> fuzz_per_pair[frozenset({"Scary Movie", "Sappy Movie"})]
82
Keep in mind that you will need tons of memory for this storage. You can halve it by using only one of the two methods above, depending on your needs / convenience.
Upvotes: 1
Reputation: 330083
Using list comprehensions and itertools.product
:
from itertools import product
[(x, y, fuzz.ratio(x, y)) for (x, y) in product(titles, repeat=2)]
Nice and lazy solution using toolz
from toolz.curried import pipe, filter, map
pipe(
product(titles, repeat=2),
# No reason to check duplicates
filter(lambda (x, y): x > y),
map(lambda (x, y): (x, y, fuzz.ratio(x, y))))
Upvotes: 1
Reputation: 90889
Maybe you can store the fuzzed
ratio in a dictionary with the (x,y)
tuple as the key, would make it easier to search the ratio
for each pair later on. For that you can create an empty dictionary outside the for loop , and then in for loop assign the fuzz.ratio(x , y)
to the key (x, y)
for that dictionary.
Example code -
fuzzDict = {}
for x in titles:
for y in titles:
fuzzDict[(x,y)] = fuzz.ratio(x,y)
Later when you want to retrieve the ratio, you can simple call fuzzDict[(x , y)]
to get it.
You can also use dictionary comprehension for that in Python 2.7+ -
{(x, y) : fuzz.ratio(x,y) for x in titles for y in titles}
Upvotes: 0