Reputation: 2071
I am dealing with timeseries of rainfall volumes, for which I want to compute the lengths and volumes of individual rainfall events, where an "event" is a sequence of non-zero timesteps. I am dealing with multiple timeseries of ~60k timesteps and my current approach is quite slow.
Currently I have the following:
import numpy as np
def count_events(timeseries):
start = 0
end = 0
lengths = []
volumes = []
# pad a 0 at the edges so as to include edges as "events"
for i, val in enumerate(np.pad(timeseries, pad_width = 1, mode = 'constant')):
if val > 0 and start==0:
start = i
if val == 0 and start>0:
end = i
if end - start != 1:
volumes.append(np.sum(timeseries[start:end]))
elif end - start == 1:
volumes.append(timeseries[start-1])
lengths.append(end-start)
start = 0
return np.asarray(lengths), np.asarray(volumes)
Expected output:
testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])
lengths, volumes = count_events(testrain)
print lengths
[1 1 4 1 1]
print volumes
[ 1. 1. 12. 0.1 1. ] # 12 should actually be 14, my code returns wrong results.
I imagine there's a far better way to do this, leveraging numpy's efficiency, but nothing comes to mind...
EDIT:
Comparing the different solutions:
testrain = np.random.normal(10,5, 60000)
testrain[testrain<0] = 0
My solution (produces wrong results, not exactly sure why):
%timeit count_events(testrain)
#10 loops, best of 3: 129 ms per loop
@dawg's:
%timeit dawg(testrain) # using itertools
#10 loops, best of 3: 113 ms per loop
%timeit dawg2(testrain) # using pure numpy
#10 loops, best of 3: 156 ms per loop
@DSM's:
%timeit DSM(testrain)
#10 loops, best of 3: 28.4 ms per loop
@DanielLenz's:
%timeit DanielLenz(testrain)
#10 loops, best of 3: 316 ms per loop
Upvotes: 4
Views: 1125
Reputation: 353369
While you can do this in pure numpy, you're basically applying numpy to a pandas
problem. Your volume
is the result of a groupby operation, which you can fake in numpy but is native to pandas.
For example:
>>> tr = pd.Series(testrain)
>>> nonzero = (tr != 0)
>>> group_ids = (nonzero & (nonzero != nonzero.shift())).cumsum()
>>> events = tr[nonzero].groupby(group_ids).agg([sum, len])
>>> events
sum len
1 1.0 1
2 1.0 1
3 14.0 4
4 0.1 1
5 1.0 1
Upvotes: 5
Reputation: 3867
Here's my approach, using labels
from scipy.ndimage.measurements
:
import numpy as np
from scipy.ndimage.measurements import label
testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])
labels, nlabels = label(testrain)
labels
>> array([1, 0, 2, 0, 3, 3, 3, 3, 0, 0, 4, 0, 0, 5], dtype=int32)
def sum_and_length(n):
obj = np.array(testrain[labels==n])
return [np.sum(obj), obj.size]
sums, lengths = np.array(map(sum_and_length, range(1, nlabels+1))).T
sums
>> array([ 1. , 1. , 14. , 0.1, 1. ])
lenghts
>> array([ 1., 1., 4., 1., 1.])
It's not the most beautiful approach, given that this problem is perfect for pandas
, but it might make you look into measurements
which is a very powerful toolset.
Upvotes: 1
Reputation: 104034
Here is a groupby solution:
import numpy as np
from itertools import groupby
testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])
lengths=[]
volumes=[]
for k, l in groupby(testrain, key=lambda v: v>0):
if k:
li=list(l)
lengths.append(len(li))
volumes.append(sum(li))
print lengths
print volumes
Prints
[1, 1, 4, 1, 1]
[1.0, 1.0, 14.0, 0.10000000000000001, 1.0]
If you want something purely in numpy:
def find_runs(arr):
subs=np.split(testrain, np.where(testrain== 0.)[0])
arrs=[np.delete(sub, np.where(sub==0.)) for sub in subs]
return [(len(e), sum(e)) for e in arrs if len(e)]
>>> find_runs(testrain)
[(1, 1.0), (1, 1.0), (4, 14.0), (1, 0.10000000000000001), (1, 1.0)]
>>> length, volume=zip(*find_runs(testrain))
Upvotes: 4