Reputation: 3658
At the moment, my algorithm takes (estimated) well over ten hours to run to completion. It's still running right now, just so I can get better estimates on how simply awful it is.
Say I have a set of people P each with a sorted list of occurrences of varying length, where i is an indexing variable. I'd like to create a graph G such that GPi,Pj = n, where n is the edge weight between Pi and Pj which represents the number of times they co-occur within a certain static range r.
My current algorithm is mindless and is implemented in Python (to be both readable and unambiguous) as follows: (altered for brevity from its repository on GitHub)
print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'
print 'Finding co-occurences'
radius = 5
for A, B in pairs:
for oA in A.occurances:
for oB in B.occurances:
if oB in range(oA - radius, oA + radius):
try:
network.edge[A.common_name][B.common_name]['weight'] += 1
except:
network.add_edge(A.common_name, B.common_name, weight=1)
I've considered altering this algorithm so that when oB
surpasses the range of the current oA
, the loop simply continues on to the next oA
.
Is there any better way of achieving this, considering that the list is sorted?
Upvotes: 2
Views: 110
Reputation: 5251
Your idea of moving on to the next oA
as soon as you pass the upper boundary is a good one. Also, if the ranges of A.occurances
and B.occurances
are large in comparison to 'radius', then it will be much more efficient to not start from the beginning of B.occurances
every time:
print '>Generating combinations...',
pairs = combinations(people, 2)
print 'Done'
print 'Finding co-occurences'
radius = 5
for A, B in pairs:
i = 0
b = B.occurances
maxi = len(B.occurances) - 1
for oA in A.occurances:
lo = oA - radius
hi = oA + radius
while (b[i] > lo) and (i > 0): # while we're above the low end of the range
i = i - 1 # go towards the low end of the range
while (b[i] < lo) and (i < maxi): # while we're below the low end of the range
i = i + 1 # go towards the low end of the range
if b[i] >= lo:
while (b[i] <= hi): # while we're below the high end of the range
try: # increase edge weight
network.edge[A.common_name][B.common_name]['weight'] += 1
except:
network.add_edge(A.common_name, B.common_name, weight=1)
if i < maxi: # and go towards the high end of the range
i = i + 1
else:
break
Note that I haven't debugged this, so there are probably bugs in it, but hopefully you can get the general idea of what I'm trying to do. There are, of course, further optimizations you could make to the idea, but this ought to be far more efficient than the brute force method.
Upvotes: 1
Reputation: 18158
One option is to put B.occurances in an interval tree so that you can quickly query for all oB in range (oA - radius, oA + radius).
Another option is to index B.occurances in buckets, e.g. [0, 1), [1, 2), etc. Then you can quickly find all oB in range (oA - radius, oA + radius) by selecting the buckets with indices (oA - radius) through (oA + radius). The buckets are approximate, so you'll need to still iteratively verify all oB in the first and last selected buckets.
Upvotes: 0