tahaum
tahaum

Reputation: 55

Selecting a subset based on multiple slices in pandas/NumPy?

I want to select a subset of some pandas DataFrame columns based on several slices.

In [1]: df = pd.DataFrame(data={'A': np.random.rand(100), 'B': np.random.rand(100), 'C': np.random.rand(100)})
        df.head()

Out[1]:            A           B           C
        0   0.745487    0.146733    0.594006
        1   0.212324    0.692727    0.244113
        2   0.954276    0.318949    0.199224
        3   0.606276    0.155027    0.247255
        4   0.155672    0.464012    0.229516

Something like:

In [2]: df.loc[[slice(1, 4), slice(42, 44)], ['B', 'C']]

Expected output:

Out[2]:            B           C
        1   0.692727    0.244113
        2   0.318949    0.199224
        3   0.155027    0.247255
        42  0.335285    0.000997
        43  0.019172    0.237810

I've seen that NumPy's r_ object can help when wanting to use multiple slices, e.g:

In [3]: arr = np.array([1, 2, 3, 4, 5, 5, 5, 5])
        arr[np.r_[1:3, 4:6]]
Out[3]: array([2, 3, 5, 5])

But I can't get this to work with some predefined collection (list) of slices. Ideally I would like to be able to specify a collection of ranges/slices and subset based on this. I doesn't seem like r_ accepts iterables? I've seen that one could for example create an array with hstack, and then use it as an index, like:

In [4]: idx = np.hstack((np.arange(1, 4), np.arange(42, 44)))
        df.loc[idx, ['B', 'C']]
Out[4]:            B           C
        1   0.692727    0.244113
        2   0.318949    0.199224
        3   0.155027    0.247255
        42  0.335285    0.000997
        43  0.019172    0.237810

Which gets me what I need, but is there any other faster/cleaner/preferred/whatever way of doing this?

Upvotes: 2

Views: 779

Answers (2)

Stefan
Stefan

Reputation: 1934

A bit late, but it might also help others:

pd.concat([df.loc[sl, ['B', 'C']] for sl in [slice(1, 4), slice(42, 44)]])

This also works when your are dealing with other slices, e.g. time windows.

Upvotes: 1

Aryerez
Aryerez

Reputation: 3495

You can do:

df.loc[[x for x in range(1, 4)] + [x for x in range(42, 44)], ['B', 'C']]

Which took about 1/4 of the time with your np.hstack option.

Upvotes: 0

Related Questions