Alessandro Mariani
Alessandro Mariani

Reputation: 1221

Python selecting multiple ranges with numpy\pandas

Is there anyway to select multiple ranges in numpy arrays or pandas dataframe efficiently all in one go?

import pandas as pd
import numpy as np
from time import time

data = pd.DataFrame(np.random.sample((10000,10)))

%timeit -n 10000 result = pd.concat((data[100:150], data[200:300]))
10000 loops, best of 3: 1.47 ms per loop

In the above example, how can I select from 100 to 150 and 200:300 without using concat? Is this even possible?

The above operation, has a bottleneck when using pd.concat and can eventually been speed up using np.vstack...but still I'd like to select the two ranges all at once without copy the underling data as concat would do.

TIME is crucial, as I'd like to get as closer as possible as the time you'll get if you access directly to the continuos range as below:

%timeit -n 10000  result = data[100:150]
10000 loops, best of 3: 94 µs per loop

Upvotes: 3

Views: 1707

Answers (3)

Alexander
Alexander

Reputation: 109726

I believe you first need to make a list of the target rows and then use iloc.

rows = [i for i in list(range(100, 150)) + list(range(200, 250))]
>>> data.iloc[rows, :]
            0         1         2         3         4         5         6         7         8         9
100  0.936412  0.875215  0.626169  0.362366  0.086108  0.709103  0.748132  0.696450  0.814539  0.502694
101  0.011131  0.733182  0.127739  0.743762  0.954454  0.018809  0.119522  0.319173  0.546778  0.982340
102  0.412659  0.977685  0.981917  0.319247  0.626653  0.845410  0.828058  0.506033  0.283324  0.495679
..        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
247  0.827967  0.803476  0.637800  0.603473  0.968779  0.976671  0.747728  0.029828  0.391113  0.381155
248  0.394331  0.120555  0.875771  0.529207  0.143756  0.334991  0.989489  0.584157  0.730615  0.187992
249  0.634841  0.624685  0.746429  0.374769  0.632195  0.922843  0.200508  0.024452  0.223971  0.457757

[100 rows x 10 columns]

%timeit rows = [i for i in list(range(100, 150)) + list(range(200, 300))]; data.iloc[rows, :]
1000 loops, best of 3: 283 µs per loop

%timeit pd.concat([data[100:150], data[200:300]])
1000 loops, best of 3: 927 µs per loop

Upvotes: 0

TheBlackCat
TheBlackCat

Reputation: 10328

There are a few approaches I can think of. We can try them and see which is fastest. However, you aren't going to be able to avoid a copy. There is no way to handle discontinuous ranges without a copy.

concat

>>> %%timeit -n 10000  data = pd.DataFrame(np.random.sample((10000,10)))
... result = pd.concat((data[100:150], data[200:300]))
...
10000 loops, best of 3: 3.81 ms per loop

List of indexes

>>> %%timeit -n 10000  data = pd.DataFrame(np.random.sample((10000,10)))
... result = data.iloc[list(range(100, 150))+list(range(200, 300))]
...
10000 loops, best of 3: 479 µs per loop

Logical indexing:

>>> %%timeit -n 10000  data = pd.DataFrame(np.random.sample((10000,10)))
... result = data[((100 <= data.index) & (data.index < 150)) | 
...               ((200 <= data.index) & (data.index < 300))]
...
10000 loops, best of 3: 580 µs per loop

slice and drop

>>> %%timeit -n 10000  data = pd.DataFrame(np.random.sample((10000,10)))
... result = data[100:300].drop(np.arange(150, 200))
...
10000 loops, best of 3: 1.22 ms per loop

So it seems like providing a list of indexes or logical indexing are the fastest, with roughly equivalent speed (I wouldn't put any weight on a speed difference that small).

Upvotes: 3

Ami Tavory
Ami Tavory

Reputation: 76406

You can combine boolean conditions, and pass them to the subscript operator:

data[((100 <= data.index) & (data.index < 150)) | ((200 <= data.index) & (data.index < 300))]

(Note the parentheses, BTW - they are unseemly, but the order of precedence requires them.)

Upvotes: 1

Related Questions