Reputation: 5274
def fancymatching(fname1, fname2):
#This function will do much smarter and fancy kinds of compares
if (fname1 == fname2):
return 1
else:
return 0
personlist = [
{
'pid':'1',
'fname':'john',
'mname':'a',
'lname':'smyth',
},{
'pid':'2',
'fname':'john',
'mnane':'a',
'lname':'smith',
},{
'pid':'3',
'fname':'bob',
'mname':'b',
'lname':'nope',
}
]
for person1 in personlist:
for person2 in personlist:
if person1['pid'] >= person2['pid']:
#don't check yourself, or ones that have been
continue
if fancymatching(person1['fname'], person2['fname']):
print (person1['pid'] + " matched " + person2['pid'])
I'm trying to improve on the idea of the above code. It works, but if personlist
becomes very large (say millions) I feel there must be something faster than 2 for loops.
What the code is doing is taking a list of dictionaries and running a fancy fuzzy matching function on the values of each dictionary against each other dictionary. So it's not as simple as just comparing all the dictionaries to the other ones. I'd like a way to run a function on each dictionary, maybe 2 for loops is the right way to do this? Any suggestions would be helpful!
Upvotes: 6
Views: 4570
Reputation: 107
Adding to MSeifert's answer, if your matching depends on fname1 == fname2 then you can sort and then group your list: ie:
from itertools import combinations, groupby
keyfunc = lambda x: x['fname']
data = sorted(personlist, key= keyfunc)
for key, group in groupby(data, key):
#every element in group will now match
for person1, person2 in combinations(group, 2):
print(person1['fname'], person2['fname'])
Obviously if you change your matching function you will need to change your key function so that it returns the same value for any elements that match and returns a different value for any elements that don't. This does rely on the fact that such a key function exists, which will not always be the case for an arbitrary matching function.
Upvotes: 0
Reputation: 152597
You can use itertools.combinations
which is essentially the same double loop but it iterates faster because it's written in C (that only reduces the constant factor, you still have the O(n**2)
runtime behaviour) and you don't need the if person1['pid'] >= person2['pid']: continue
anymore (that's built into the combinations
function already).
from itertools import combinations
for person1, person2 in combinations(personlist, 2):
print(person1['fname'], person2['fname'])
which prints:
('john', 'john')
('john', 'bob')
('john', 'bob')
However if your fancymatching
allows it then you could also group (O(n)
runtime) your values. For example in your case you only match identical 'fname'
-values.
>>> matches = {}
>>> for person in personlist:
... matches.setdefault(person['fname'], []).append(person)
>>> matches
{'bob': [{'fname': 'bob', 'lname': 'nope', 'mname': 'b', 'pid': '3'}],
'john': [{'fname': 'john', 'lname': 'smyth', 'mname': 'a', 'pid': '1'},
{'fname': 'john', 'lname': 'smith', 'mnane': 'a', 'pid': '2'}]}
But that's only possible if your fancymatching
allows such a grouping. Which is True for your case but if it's more complicated it might not be.
Upvotes: 7