Sean Allred
Sean Allred

Reputation: 3658

Algorithm optimization for finding co-occurences between two sorted lists

At the moment, my algorithm takes (estimated) well over ten hours to run to completion. It's still running right now, just so I can get better estimates on how simply awful it is.

Say I have a set of people P each with a sorted list of occurrences of varying length, where i is an indexing variable. I'd like to create a graph G such that GPi,Pj = n, where n is the edge weight between Pi and Pj which represents the number of times they co-occur within a certain static range r.

My current algorithm is mindless and is implemented in Python (to be both readable and unambiguous) as follows: (altered for brevity from its repository on GitHub)

print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'

print 'Finding co-occurences'
radius = 5
for A, B in pairs:
    for oA in A.occurances:
        for oB in B.occurances:
            if oB in range(oA - radius, oA + radius):
                try:
                    network.edge[A.common_name][B.common_name]['weight'] += 1
                except:
                    network.add_edge(A.common_name, B.common_name, weight=1)

I've considered altering this algorithm so that when oB surpasses the range of the current oA, the loop simply continues on to the next oA.

Is there any better way of achieving this, considering that the list is sorted?

Upvotes: 2

Views: 110

Answers (2)

Neil Forrester
Neil Forrester

Reputation: 5251

Your idea of moving on to the next oA as soon as you pass the upper boundary is a good one. Also, if the ranges of A.occurances and B.occurances are large in comparison to 'radius', then it will be much more efficient to not start from the beginning of B.occurances every time:

print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'

print 'Finding co-occurences'
radius = 5
for A, B in pairs:
    i = 0
    b = B.occurances
    maxi = len(B.occurances) - 1
    for oA in A.occurances:
        lo = oA - radius
        hi = oA + radius
        while (b[i] > lo) and (i > 0):     # while we're above the low end of the range
            i = i - 1                      #   go towards the low end of the range
        while (b[i] < lo) and (i < maxi):  # while we're below the low end of the range
            i = i + 1                      #   go towards the low end of the range
        if b[i] >= lo:
            while (b[i] <= hi):            # while we're below the high end of the range
                try:                       #   increase edge weight
                    network.edge[A.common_name][B.common_name]['weight'] += 1
                except:
                    network.add_edge(A.common_name, B.common_name, weight=1)

                if i < maxi:               #   and go towards the high end of the range
                    i = i + 1
                else:
                    break

Note that I haven't debugged this, so there are probably bugs in it, but hopefully you can get the general idea of what I'm trying to do. There are, of course, further optimizations you could make to the idea, but this ought to be far more efficient than the brute force method.

Upvotes: 1

Zim-Zam O&#39;Pootertoot
Zim-Zam O&#39;Pootertoot

Reputation: 18158

One option is to put B.occurances in an interval tree so that you can quickly query for all oB in range (oA - radius, oA + radius).

Another option is to index B.occurances in buckets, e.g. [0, 1), [1, 2), etc. Then you can quickly find all oB in range (oA - radius, oA + radius) by selecting the buckets with indices (oA - radius) through (oA + radius). The buckets are approximate, so you'll need to still iteratively verify all oB in the first and last selected buckets.

Upvotes: 0

Related Questions