Reputation: 845
I am doing some entity matching based on string edit distance and my results are a dictionary with keys (query string) and values [list of similar strings] based on some scoring criteria.
for example:
results = {
'ben' : ['benj', 'benjamin', 'benyamin'],
'benj': ['ben', 'beny', 'benjamin'],
'benjamin': ['benyamin'],
'benyamin': ['benjamin'],
'carl': ['karl'],
'karl': ['carl'],
}
Each value also has a corresponding dictionary item, for which it is the key (e.g. 'carl' and 'karl').
I need to combine the elements that have shared values. Choosing one value as the new key (lets say the longest string). In the above example I would hope to get:
results = {
'benjamin': ['ben', 'benj', 'benyamin', 'beny', 'benjamin', 'benyamin'],
'carl': ['carl','karl']
}
I have tried iterating through the dictionary using the keys, but I can't wrap my head around how to iterate and compare through each dictionary item and its list of values (or single value).
Upvotes: 3
Views: 211
Reputation: 164623
This is one solution using collections.defaultdict
and sets.
The desired output is very similar to what you have, and can be easily manipulated to align.
from collections import defaultdict
results = {
'ben' : ['benj', 'benjamin', 'benyamin'],
'benj': ['ben', 'beny', 'benjamin'],
'benjamin': 'benyamin',
'benyamin': 'benjamin',
'carl': 'karl',
'karl': 'carl',
}
d = defaultdict(set)
for i, (k, v) in enumerate(results.items()):
w = {k} | (set(v) if isinstance(v, list) else {v})
for m, n in d.items():
if not n.isdisjoint(w):
d[m].update(w)
break
else:
d[i] = w
result = {max(v, key=len): v for k, v in d.items()}
# {'benjamin': {'ben', 'benj', 'benjamin', 'beny', 'benyamin'},
# 'carl': {'carl', 'karl'}}
Credit to @IMCoins for the idea of manipulating v
to w
in second loop.
Explanation
There are 3 main steps:
Upvotes: 3
Reputation: 3306
EDIT : Even though performance was not the question here, I took the liberty to perform some tests between jpp's answer, and mine... here is the full script. My script performs the tests in 17.79 seconds, and his in 23.5 seconds.
import timeit
results = {
'ben' : ['benj', 'benjamin', 'benyamin'],
'benj': ['ben', 'beny', 'benjamin'],
'benjamin': ['benyamin'],
'benyamin': ['benjamin'],
'carl': ['karl'],
'karl': ['carl'],
}
def imcoins(result):
new_dict = {}
# .items() for python3x
for k, v in results.iteritems():
flag = False
# Checking if key exists...
if k not in new_dict.keys():
# But then, we also need to check its values.
for item in v:
if item in new_dict.keys():
# If we update, set the flag to True, so we don't create a new value.
new_dict[item].update(v)
flag = True
if flag == False:
new_dict[k] = set(v)
# Now, to sort our newly created dict...
sorted_dict = {}
for k, v in new_dict.iteritems():
max_string = max(v)
if len(max_string) > len(k):
sorted_dict[max(v, key=len)] = set(v)
else:
sorted_dict[k] = v
return sorted_dict
def jpp(result):
from collections import defaultdict
res = {i: {k} | (set(v) if isinstance(v, list) else {v}) \
for i, (k, v) in enumerate(results.items())}
d = defaultdict(set)
for i, (k, v) in enumerate(res.items()):
for m, n in d.items():
if n & v:
d[m].update(v)
break
else:
d[i] = v
result = {max(v, key=len): v for k, v in d.items()}
return result
iterations = 1000000
time1 = timeit.timeit(stmt='imcoins(results)', setup='from __main__ import imcoins, results', number=iterations)
time2 = timeit.timeit(stmt='jpp(results)', setup='from __main__ import jpp, results', number=iterations)
print time1 # Outputs : 17.7903265883
print time2 # Outputs : 23.5605850732
If I move the import from his function to global scope, it gives...
imcoins : 13.4129249463 seconds
jpp : 21.8191823393 seconds
Upvotes: 2