Pairing duplicate objects in a very large list

Question

I'm having issues with my program being too slow. Here is what I need my code to do. (Actual code is below):

Imagine you have a very large (1million+) list of "person" objects. Each person has attributes like "hairColor", "height", "weight"....etc. Each had a isMatched Boolean attribute also.

What the code does is FOR EVERY person object in the list: try to find an exact match to another person by the attributes. When the code finds two person objects that match simply put them into a "pair" object. If there is no match then we simply pair the person object with None. Here is the code

Deck=collections.deque()
Person=[per1,per2,.....per1000000]
while Person:
    personToBeMatched=Person.pop()
    if not personToBeMatched.isMatched:
        match=next(per for per in Person if  per.hairColor==personToBeMatched.hairColor and per.height==personToBeMatched.height, None)
        if match is not None:
            per.isMatched=True
    Deck.append(Pair(match,personToBeMatched))

I ultimately want to do this quickly for a list of size 1million+ but even 100,000 is taking far too long. Does anyone know how I can speed this up??

bobflux · Accepted Answer

Since you are only looking for exact matches, the solution is very simple.

For each person, build a tuple of attributes. Here are a few ways to do this:

person.tup = (person.hairColor, person.height, ...)

or

keys = "hairColor", "height", .....
tup = (person[k] for k in keys)

Now, since tuples are hashable, we can just put everything in a dict and use the tuple as a key:

d = {}
for person in persons:
   tup = gettuple(person) # use previous code snippets
   d.setdefault(tup,[]).append(person)

...and in the end, dict d contains lists of all matches. Note that some matches may include more than 2 persons, maybe an odd number, so you'll have to choose how to pair them, that's up to you. But this will handle your matching very quickly, since dict lookups are very fast.

The reason this solution is much faster is that it is O(n), that is only one dict lookup is done for each person, and dict lookups are O(1). Your solution compares every record with every other record, and is thus in O(n^2) which is something you really want to avoid.

Another solution is to sort based on the tuple of keys and the matches will appear next to each other in the sorted results.

EDIT:

Complexity of your solution (ie, number of operations) with a list of size n:

It takes an element out of the list, and compares it to all the others, so we have at most (n-1) comparisons. It stops at the first element it finds, so it will usually be less. So we could add a fudge factor k=0.5 and say on average it finds a match in the first k*(n-1) elements of the list.

Then it does it again on the list which is 1 element shorter (thus n-2) elements.... until the list is exhausted.

So you got k*( (n-1)+(n-2)+ ... +2+1 ) operations.

And this is more or less k/2 * n^2

Point is, the value of k (or the 1/2) doesn't really matter when you have a n^2. This is why we say O(n^2) meaning "n^2 multiplied by whatever constant we don't care about, since when n grows large, it's gonna blow up anyway."

You really want O(n) for speed, like my solution (scan the list once, build a hash, scan the list again looking for matches).

If the list were huge (ie, does not fit in RAM) then stuff based on hashes tends to fail spectacularly as the random access patterns into the hash map are not compatible with virtual memory stored on a slow drive. So you'd want something which has disk-friendly access patterns, like a merge sort in O(n log n). You'd probably use a SQL database for this, as they come with the feature built-in.

Pairing duplicate objects in a very large list

Answers (1)

Related Questions