Rotavator
Rotavator

Reputation: 123

Fast subsetting in Pandas in Python

I am running a loop a few million times, and I need to subset a different amount of data in each loop. I have a dataframe that has two columns, time (which is a time series) and electrode, which signifies a number between 1-64 for any electrode fired at that time.

time    electrode
 0          1
 1          43
 2          45
 3          12
 4          7

In each loop I need to subset the data, as such:

num_electrodes = []
window_size = 5
index = 0
while index < len(data['time']) - interval_size:
    start = data['time'][index]
    end = data['time'][index+window_size]
    window_data = data[(data['time'] >= start) & (data['time'] < end)]
    num_electrodes.append(len(window_data['electrode'].unique()))

The really slow part of the code here is subsetting the dataframe and making a new dataframe, in the following code.

window_data = data[(data['time'] >= start) & (data['time'] < end)]

Is there any good alternative to this?

Upvotes: 2

Views: 1759

Answers (3)

Rotavator
Rotavator

Reputation: 123

So I solved this by switching to numpy.ndarray which just went infinitely faster than indexing with iloc.

Upvotes: 1

John R
John R

Reputation: 1508

Sort by your time, then you can use .loc to access the indices at the beginning and end of your window, and then select a range of indices as your subset.

Set your df's index to the time series, then use df.index.get_loc(beginning_window) and min(df.index.get_loc(beginning_window+window+1)) -1 to get your index range.

The min accounts for non-unique indices.

Then use .iloc to select that range.

That should speed it up by quite a bit.

Upvotes: 1

B. M.
B. M.

Reputation: 18638

Assuming your data is sorted by time, you just have to group the electrodes by 5. Then set can be faster than np.unique :

size=10**6
window_size=5
electrodes = np.random.randint(0,64,size)
electrodes_by_5 = electrodes.reshape(-1,window_size)

nb_electrodes=np.apply_along_axis(lambda arr:len(set(arr)),1,electrodes_by_5)

Output :

In [463]: electrodes[:10]
Out[463]: array([13, 13, 23, 20,  5, 30,  9,  6, 28, 11])

In [464]: electrodes_by_5[:2]
Out[464]: 
array([[13, 13, 23, 20,  5],
       [30,  9,  6, 28, 11]])

In [465]: nb_electrodes[:2]
Out[465]: array([4, 5])

Upvotes: 0

Related Questions