Reputation: 8087
This question is linked to this: First Python list index greater than x?
I have a (sorted) list of floats, and I want to find the first index that exceeds each value of a second list
e.g.
l=[0.2,0.3,0.7,0.9]
m=[0.25,0.6]
if m were a float I would use this:
bisect.bisect_left(l,m)
But for the case where m is a list this fails, and I can only think to employ a list comprehension:
[bisect.bisect_left(l,i) for i in m]
which gives:
[1, 2]
which works, but I want to speed it up for large lists in my real example by avoiding the list comprehension as my tests showed this was the "bottleneck" operation (I earlier stated I suspected it was too slow). Is there a way to efficiently do this using a vectorized function in e.g. numpy or an improved algorithm (as only one traverse of the list is required)?
Upvotes: 1
Views: 739
Reputation: 8087
So I found there is a numpy function to perform this task, np.searchsorted. which is much faster than the use of list comprehensions.
result=np.searchsorted(searchlist,newindices)
These are the timings for the various solutions:
1. Standard List comprehension:
this was my first attempt at a solution
python3 -m timeit -s "import numpy as np" -s "import bisect" -s "h=np.sort(np.random.uniform(size=10000))" -s "n=np.sort(np.random.uniform(size=1000))" "r=[bisect.bisect_left(h,i) for i in n]"
200 loops, best of 5: 1.61 msec per loop
2. Shortened search in for loop
This was the solution kindly provided by @lenik
python3 -m timeit -s "import numpy as np" -s "import bisect" -s "h=np.sort(np.random.uniform(size=10000))" -s "n=np.sort(np.random.uniform(size=1000))" "r=[bisect.bisect_left(h,n[0])]" "for i in n[1:]:" " r.append(bisect.bisect_left(h,i,r[-1]))"
200 loops, best of 5: 1.6 msec per loop
Hardly different from the list comprehension which I was somewhat surprised about...
3. Numpy searchsorted
python3 -m timeit -s "import numpy as np" -s "import bisect" -s "h=np.sort(np.random.uniform(size=10000))" -s "n=np.sort(np.random.uniform(size=1000))" "r=np.searchsorted(h,n)"
10000 loops, best of 5: 33.6 usec per loop
Approximately 50 times faster than the list comprehension based solutions for this example, so hands down the fastest.
Upvotes: 1
Reputation: 881463
Well, there's a good chance that bisect_left
is an O(logN)
operation (binary search) so your overall operation would be O(KlogN) where N relates to size of l, and K relates to size of m
.
Were the second list m
sorted as well, you could make this an O(N)
operation simply by running an index through both lists concurrently.
But, re your "I suspect this is slow" comment, your first move should always be to test the easiest solution with the largest expected data set. If that's workable, stop right there! Only if it's deficient do you start thinking about optimisation.
For example, consider the following program:
import random
import bisect
haystack = []
for _ in range(1000000):
haystack.append(random.random())
haystack.sort()
needles = []
for _ in range(10000):
needles.append(random.random())
result = [bisect.bisect_left(haystack, needle) for needle in needles]
print(result)
This creates a 1,000,000-element haystack and a 10,000-element needle list then uses your bisect
-ing list comprehension to do the work. Running that on my (not particularly grunty) desktop with time
shows:
real 0m0.738s # < 3/4 of a second elapsed
user 0m0.578s
sys 0m0.109s
And this includes the time taken to construct the lists, sort the big one, and print out the results.
Using timeit
to get rid of all that setup time can be done with:
import timeit
import random
import bisect
haystack = []
for _ in range(1000000):
haystack.append(random.random())
haystack.sort()
needles = []
for _ in range(10000):
needles.append(random.random())
print(timeit.timeit('[bisect.bisect_left(haystack, needle) for needle in needles]', setup = 'from __main__ import bisect, haystack, needles', number = 1000))
The output of that is 12.27
for the thousand iterations, which means you could do it about 75 times per second without breaking a sweat.
Upvotes: 4
Reputation: 23528
You have to remember the last value found to use it as a starting point for the next binary search, so instead of the list comprehension you have to use a for loop:
result = [bisect.bisect_left(l,m[0]),]
for i in m[1:] :
result.append( bisect.bisect_left(l,i,result[-1]))
This should work faster than a simple comprehension.
Upvotes: 1