Reputation: 6217
Suppose I have an array of times. I know a-priori that the maximum time is 1, say, so the array may look like
events = [0.1, 0.2, 0.7, 0.93, 1.37]
The numbers in that array represent when an event has occurred in the time interval [0,1]
(and I am ignoring whatever that is larger than 1).
I don't know a-priori the size of the array, but I do have reasonable upper bounds on its size (if that matters), so I can even safely truncate it if needed.
I need to convert that array into an array which counts the number of events up to time x, where x is some set of evenly spaced numbers in the interval of times (linspace
). So, for example, if the granularity (=size) of that array
is 7, the result of my function should look like:
def count_events(events, granularity):
...
>>> count_events([0.1, 0.2, 0.7, 0.93, 1.37], 7)
array([0, 1, 2, 2, 2, 3, 4])
# since it checks at times 0, 1/6, 1/3, 1/2, 2/3, 5/6, 1.
I am looking for an efficient solution. Making a loop is probably very easy here, but my event arrays may be huge. In fact, they are not 1D but rather 2D, and this counting operation should be per-axis (like many other numpy functions). To be more precise, here is a 2D example:
def count_events(events, granularity, axis=None):
...
>>> events = array([[0.1, 0.2, 0.7, 0.93, 1.37], [0.01, 0.01, 0.9, 2.5, 3.3]])
>>> count_events(events, 7, axis=1)
array([[0, 1, 2, 2, 2, 3, 4],
[0, 2, 2, 2, 2, 2, 3]])
Upvotes: 2
Views: 304
Reputation: 221514
You can simply use np.searchsorted
-
np.searchsorted(events, d) # with events being a 1D array
, where d
is the linspaced array, created like so -
d = np.linspace(0,1,7) # 7 being the interval size
Sample run for the 2D
case -
In [548]: events
Out[548]:
array([[ 0.1 , 0.2 , 0.7 , 0.93, 1.37],
[ 0.01, 0.01, 0.9 , 2.5 , 3.3 ]])
In [549]: np.searchsorted(events[0], d) # Use per row
Out[549]: array([0, 1, 2, 2, 2, 3, 4])
In [550]: np.searchsorted(events[1], d)
Out[550]: array([0, 2, 2, 2, 2, 2, 3])
Using a vectorized version of searchsorted
: searchsorted2d
, we can even vectorize the whole thing and use on all rows in one go, like so -
In [552]: searchsorted2d(events,d)
Out[552]:
array([[0, 1, 2, 2, 2, 3, 4],
[0, 2, 2, 2, 2, 2, 3]])
Upvotes: 1
Reputation: 3094
Given that your array is sorted, one idea that comes to mind to do better than linear, is to conduct a binary search for each of your evenly spaced value. Doing so you can retrieve everytime the right most index in your array such that the value at this index is greater or equal to the searched value. This can be done very efficiently with python's bisect_right
function from the builtin bisect module.
bisect(a, x)
returns an insertion point which comes after (to the right of) any existing entries of x in a
An example code could go like
import numpy as np
from bisect import bisect_right
# define your_array somehow
N = 10 # the number of time intervals
lin_vals = np.linspace(0., 1., N)
counts = []
for i in range(your_array.shape[0]):
row = your_array[i]
tmp = [] # the counts for this row
tot = 0
for v in lin_vals:
idx = bisect_right(row, v)
tmp.append(tot+idx)
tot += idx
counts.append(tmp)
I haven't tested this code, but it should give you the general idea.
Doing so you'll have a complexity of roughly R*T*log(N)
where R is the number of rows, T the number of time intervals, and N the size of the array.
If this is still not fast enough, consider cropping your array rows to remove values greater than 1.
Next you could gain speed by limitting the searches for the next linspaced values to row[prev_idx:]
to speed up the binary search.
You could also try to gain speed by re implementing bisect_right to return the upper idx it has found such that the value at this index is strictly greater than the next lin spaced value you'll be dealing with. This way you can restrict row on both side and be even faster!
Upvotes: 0