Algorithm optimization for finding co-occurences between two sorted lists

Question

At the moment, my algorithm takes (estimated) well over ten hours to run to completion. It's still running right now, just so I can get better estimates on how simply awful it is.

Say I have a set of people P each with a sorted list of occurrences of varying length, where i is an indexing variable. I'd like to create a graph G such that G_{P_i,P_j} = n, where n is the edge weight between P_i and P_j which represents the number of times they co-occur within a certain static range r.

My current algorithm is mindless and is implemented in Python (to be both readable and unambiguous) as follows: (altered for brevity from its repository on GitHub)

print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'

print 'Finding co-occurences'
radius = 5
for A, B in pairs:
    for oA in A.occurances:
        for oB in B.occurances:
            if oB in range(oA - radius, oA + radius):
                try:
                    network.edge[A.common_name][B.common_name]['weight'] += 1
                except:
                    network.add_edge(A.common_name, B.common_name, weight=1)

I've considered altering this algorithm so that when oB surpasses the range of the current oA, the loop simply continues on to the next oA.

Is there any better way of achieving this, considering that the list is sorted?

Neil Forrester · Accepted Answer

Your idea of moving on to the next oA as soon as you pass the upper boundary is a good one. Also, if the ranges of A.occurances and B.occurances are large in comparison to 'radius', then it will be much more efficient to not start from the beginning of B.occurances every time:

print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'

print 'Finding co-occurences'
radius = 5
for A, B in pairs:
    i = 0
    b = B.occurances
    maxi = len(B.occurances) - 1
    for oA in A.occurances:
        lo = oA - radius
        hi = oA + radius
        while (b[i] > lo) and (i > 0):     # while we're above the low end of the range
            i = i - 1                      #   go towards the low end of the range
        while (b[i] < lo) and (i < maxi):  # while we're below the low end of the range
            i = i + 1                      #   go towards the low end of the range
        if b[i] >= lo:
            while (b[i] <= hi):            # while we're below the high end of the range
                try:                       #   increase edge weight
                    network.edge[A.common_name][B.common_name]['weight'] += 1
                except:
                    network.add_edge(A.common_name, B.common_name, weight=1)

                if i < maxi:               #   and go towards the high end of the range
                    i = i + 1
                else:
                    break

Note that I haven't debugged this, so there are probably bugs in it, but hopefully you can get the general idea of what I'm trying to do. There are, of course, further optimizations you could make to the idea, but this ought to be far more efficient than the brute force method.

Algorithm optimization for finding co-occurences between two sorted lists

Answers (2)

Related Questions