Reputation: 1221
Is there anyway to select multiple ranges in numpy arrays or pandas dataframe efficiently all in one go?
import pandas as pd
import numpy as np
from time import time
data = pd.DataFrame(np.random.sample((10000,10)))
%timeit -n 10000 result = pd.concat((data[100:150], data[200:300]))
10000 loops, best of 3: 1.47 ms per loop
In the above example, how can I select from 100 to 150 and 200:300 without using concat? Is this even possible?
The above operation, has a bottleneck when using pd.concat and can eventually been speed up using np.vstack...but still I'd like to select the two ranges all at once without copy the underling data as concat would do.
TIME is crucial, as I'd like to get as closer as possible as the time you'll get if you access directly to the continuos range as below:
%timeit -n 10000 result = data[100:150]
10000 loops, best of 3: 94 µs per loop
Upvotes: 3
Views: 1707
Reputation: 109726
I believe you first need to make a list of the target rows and then use iloc
.
rows = [i for i in list(range(100, 150)) + list(range(200, 250))]
>>> data.iloc[rows, :]
0 1 2 3 4 5 6 7 8 9
100 0.936412 0.875215 0.626169 0.362366 0.086108 0.709103 0.748132 0.696450 0.814539 0.502694
101 0.011131 0.733182 0.127739 0.743762 0.954454 0.018809 0.119522 0.319173 0.546778 0.982340
102 0.412659 0.977685 0.981917 0.319247 0.626653 0.845410 0.828058 0.506033 0.283324 0.495679
.. ... ... ... ... ... ... ... ... ... ...
247 0.827967 0.803476 0.637800 0.603473 0.968779 0.976671 0.747728 0.029828 0.391113 0.381155
248 0.394331 0.120555 0.875771 0.529207 0.143756 0.334991 0.989489 0.584157 0.730615 0.187992
249 0.634841 0.624685 0.746429 0.374769 0.632195 0.922843 0.200508 0.024452 0.223971 0.457757
[100 rows x 10 columns]
%timeit rows = [i for i in list(range(100, 150)) + list(range(200, 300))]; data.iloc[rows, :]
1000 loops, best of 3: 283 µs per loop
%timeit pd.concat([data[100:150], data[200:300]])
1000 loops, best of 3: 927 µs per loop
Upvotes: 0
Reputation: 10328
There are a few approaches I can think of. We can try them and see which is fastest. However, you aren't going to be able to avoid a copy. There is no way to handle discontinuous ranges without a copy.
concat
>>> %%timeit -n 10000 data = pd.DataFrame(np.random.sample((10000,10)))
... result = pd.concat((data[100:150], data[200:300]))
...
10000 loops, best of 3: 3.81 ms per loop
List of indexes
>>> %%timeit -n 10000 data = pd.DataFrame(np.random.sample((10000,10)))
... result = data.iloc[list(range(100, 150))+list(range(200, 300))]
...
10000 loops, best of 3: 479 µs per loop
Logical indexing:
>>> %%timeit -n 10000 data = pd.DataFrame(np.random.sample((10000,10)))
... result = data[((100 <= data.index) & (data.index < 150)) |
... ((200 <= data.index) & (data.index < 300))]
...
10000 loops, best of 3: 580 µs per loop
slice and drop
>>> %%timeit -n 10000 data = pd.DataFrame(np.random.sample((10000,10)))
... result = data[100:300].drop(np.arange(150, 200))
...
10000 loops, best of 3: 1.22 ms per loop
So it seems like providing a list of indexes or logical indexing are the fastest, with roughly equivalent speed (I wouldn't put any weight on a speed difference that small).
Upvotes: 3
Reputation: 76406
You can combine boolean conditions, and pass them to the subscript operator:
data[((100 <= data.index) & (data.index < 150)) | ((200 <= data.index) & (data.index < 300))]
(Note the parentheses, BTW - they are unseemly, but the order of precedence requires them.)
Upvotes: 1