Reputation: 55
I want to select a subset of some pandas DataFrame columns based on several slices.
In [1]: df = pd.DataFrame(data={'A': np.random.rand(100), 'B': np.random.rand(100), 'C': np.random.rand(100)})
df.head()
Out[1]: A B C
0 0.745487 0.146733 0.594006
1 0.212324 0.692727 0.244113
2 0.954276 0.318949 0.199224
3 0.606276 0.155027 0.247255
4 0.155672 0.464012 0.229516
Something like:
In [2]: df.loc[[slice(1, 4), slice(42, 44)], ['B', 'C']]
Expected output:
Out[2]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
I've seen that NumPy's r_ object can help when wanting to use multiple slices, e.g:
In [3]: arr = np.array([1, 2, 3, 4, 5, 5, 5, 5])
arr[np.r_[1:3, 4:6]]
Out[3]: array([2, 3, 5, 5])
But I can't get this to work with some predefined collection (list) of slices. Ideally I would like to be able to specify a collection of ranges/slices and subset based on this. I doesn't seem like r_
accepts iterables? I've seen that one could for example create an array with hstack
, and then use it as an index, like:
In [4]: idx = np.hstack((np.arange(1, 4), np.arange(42, 44)))
df.loc[idx, ['B', 'C']]
Out[4]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
Which gets me what I need, but is there any other faster/cleaner/preferred/whatever way of doing this?
Upvotes: 2
Views: 779
Reputation: 1934
A bit late, but it might also help others:
pd.concat([df.loc[sl, ['B', 'C']] for sl in [slice(1, 4), slice(42, 44)]])
This also works when your are dealing with other slices, e.g. time windows.
Upvotes: 1
Reputation: 3495
You can do:
df.loc[[x for x in range(1, 4)] + [x for x in range(42, 44)], ['B', 'C']]
Which took about 1/4 of the time with your np.hstack
option.
Upvotes: 0