sniperd
sniperd

Reputation: 5274

Python - something faster than 2 nested for loops

def fancymatching(fname1, fname2):
#This function will do much smarter and fancy kinds of compares
    if (fname1 == fname2):
        return 1
    else:
        return 0

personlist = [
{ 
'pid':'1',
'fname':'john',
'mname':'a',
'lname':'smyth',
},{ 
'pid':'2',
'fname':'john',
'mnane':'a',
'lname':'smith',
},{ 
'pid':'3',
'fname':'bob',
'mname':'b',
'lname':'nope',
}
]

for person1 in personlist:
    for person2 in personlist:
        if person1['pid'] >= person2['pid']:
            #don't check yourself, or ones that have been
        continue
        if fancymatching(person1['fname'], person2['fname']):
            print (person1['pid'] + " matched " + person2['pid'])

I'm trying to improve on the idea of the above code. It works, but if personlist becomes very large (say millions) I feel there must be something faster than 2 for loops.

What the code is doing is taking a list of dictionaries and running a fancy fuzzy matching function on the values of each dictionary against each other dictionary. So it's not as simple as just comparing all the dictionaries to the other ones. I'd like a way to run a function on each dictionary, maybe 2 for loops is the right way to do this? Any suggestions would be helpful!

Upvotes: 6

Views: 4570

Answers (2)

Fletcher Stump Smith
Fletcher Stump Smith

Reputation: 107

Adding to MSeifert's answer, if your matching depends on fname1 == fname2 then you can sort and then group your list: ie:

from itertools import combinations, groupby

keyfunc = lambda x: x['fname']
data = sorted(personlist, key= keyfunc)
for key, group in groupby(data, key):
    #every element in group will now match
    for person1, person2 in combinations(group, 2):
        print(person1['fname'], person2['fname'])

Obviously if you change your matching function you will need to change your key function so that it returns the same value for any elements that match and returns a different value for any elements that don't. This does rely on the fact that such a key function exists, which will not always be the case for an arbitrary matching function.

Upvotes: 0

MSeifert
MSeifert

Reputation: 152597

You can use itertools.combinations which is essentially the same double loop but it iterates faster because it's written in C (that only reduces the constant factor, you still have the O(n**2) runtime behaviour) and you don't need the if person1['pid'] >= person2['pid']: continue anymore (that's built into the combinations function already).

from itertools import combinations

for person1, person2 in combinations(personlist, 2):
    print(person1['fname'], person2['fname'])

which prints:

('john', 'john')
('john', 'bob')
('john', 'bob')

However if your fancymatching allows it then you could also group (O(n) runtime) your values. For example in your case you only match identical 'fname'-values.

>>> matches = {}
>>> for person in personlist:
...     matches.setdefault(person['fname'], []).append(person)
>>> matches
{'bob': [{'fname': 'bob', 'lname': 'nope', 'mname': 'b', 'pid': '3'}],
 'john': [{'fname': 'john', 'lname': 'smyth', 'mname': 'a', 'pid': '1'}, 
          {'fname': 'john', 'lname': 'smith', 'mnane': 'a', 'pid': '2'}]}

But that's only possible if your fancymatching allows such a grouping. Which is True for your case but if it's more complicated it might not be.

Upvotes: 7

Related Questions