Reputation: 123
I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.
time electrode
0 1
1 43
2 45
3 12
4 7
In each loop I need to subset the data, as such:
num_electrodes = []
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
start = data['time'][index]
end = data['time'][index+window_size]
window_data = data[(data['time'] >= start) & (data['time'] < end)]
num_electrodes.append(len(window_data['electrode'].unique()))
The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.
window_data = data[(data['time'] >= start) & (data['time'] < end)]
Is there any good alternative to this?
Upvotes: 2
Views: 1759
Reputation: 123
So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.
Upvotes: 1
Reputation: 1508
Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.
Set your df's index to the time series, then use df.index.get_loc(beginning_window)
and min(df.index.get_loc(beginning_window+window+1)) -1
to get your index range.
The min accounts for non-unique indices.
Then use .iloc to select that range.
That should speed it up by quite a bit.
Upvotes: 1
Reputation: 18638
Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set
can be faster than np.unique
:
size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)
nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)
Output :
In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20, 5, 30, 9, 6, 28, 11])
In [464]: electrodes_by_5[:2]
Out[464]:
array([[13, 13, 23, 20, 5],
[30, 9, 6, 28, 11]])
In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])
Upvotes: 0