tomasz74
tomasz74

Reputation: 16671

Select multiple sections of rows by index in pandas

I have large DataFrame with GPS path and some attributes. A few sections of the path are those which I need to analyse. I would like to subset only those sections to a new DataFrame. I can subset one section at the time but the idea is to have them all and to have an original index.

The problem is similar to:

import pandas as pd 
df = pd.DataFrame({'A':[0,1,2,3,4,5,6,7,8,9],'B':['a','b','c','d','e','f','g','h','i','j']},
                  index=range(10,20,))

I want o get something like:

cdf = df.loc[[11:13] & [17:20]] # SyntaxError: invalid syntax

desired outcome:

    A  B
11  1  b
12  2  c
13  3  d
17  7  h
18  8  i
19  9  j

I know the example is easy with cdf = df.loc[[11,12,13,17,18,19],:] but in the original problem I have thousands of lines and some entries already removed, so listing points is rather not an option.

Upvotes: 5

Views: 8058

Answers (3)

sammywemmy
sammywemmy

Reputation: 28699

One option is with pyjanitor select_rows - note that the selection is based on the label, not the integer position:

# pip install pyjanitor
import pandas as pd

df.select_rows(slice(11,13), slice(17,20))
    A  B
11  1  b
12  2  c
13  3  d
17  7  h
18  8  i
19  9  j

Upvotes: 0

jezrael
jezrael

Reputation: 862791

One possible solution with concat:

cdf = pd.concat([df.loc[11:13], df.loc[17:20]])
print (cdf)
    A  B
11  1  b
12  2  c
13  3  d
17  7  h
18  8  i
19  9  j

Another solution with range:

cdf = df.loc[list(range(11,14)) + list(range(17,20))]
print (cdf)
    A  B
11  1  b
12  2  c
13  3  d
17  7  h
18  8  i
19  9  j

Upvotes: 5

unutbu
unutbu

Reputation: 879759

You could use np.r_ to concatenate the slices:

In [16]: df.loc[np.r_[11:13, 17:20]]
Out[16]: 
    A  B
11  1  b
12  2  c
17  7  h
18  8  i
19  9  j

Note, however, that df.loc[A:B] selects labels A through B with B included. np.r_[A:B] returns an array of A through B with B excluded. To include B you would need to use np.r_[A:B+1].

When passed a slice, such as df.loc[A:B], df.loc ignores labels that are not in df.index. In contrast, when passed an array, such as df.loc[np.r_[A:B]], df.loc may add a new row filled with NaNs for each value in the array which is not in df.index.

Thus to produce the desired result, you would need to adjust the right endpoint of the slices and use isin to test for membership in df.index:

In [26]: df.loc[df.index.isin(np.r_[11:14, 17:21])]
Out[26]: 
    A  B
11  1  b
12  2  c
13  3  d
17  7  h
18  8  i
19  9  j

Upvotes: 11

Related Questions