Confusion with pandas Series slicing

Question

I have a pandas DataFrame where one column is the trip distance a taxi has covered. Im using value_counts() on this column in order to see the most common trip distances.

b = df['trip_distance'].value_counts()

Object b is a pandas Series object. For the sake of completion, the first 5 rows of this Series are

1.00     21815
0.90     18915
0.80     18449
1.10     18263
1.20     17823

Where this means that the most common trip distance is 1, and it appears 21815 times and the same for the rest.

However, if I type b[0:4] instead of printing the 4 first elements of this Series, it finds the element that corresponds to 0 trip distance, and starts printing all the trip distances until it reaches trip distance 4. Of course, if trip distance 4 comes before trip distance 0, it returns an empty Series.

Nevertheless, when I try it on a custom Series

a = pd.Series([3, 1, 2, 3, 4, 4, 5]).value_counts()

Printing a gives

and when I try to slice this series, that is, when I type a[0, 3] I get the expected

4    2
3    2
5    1

Does anyone know why this is happening? I know this can be done with iloc/loc, im just curious about why slicing works in one list but not the other.

Thanks in advance.

Cameron Riddell · Accepted Answer

When indexing values from a Series (or rows from a dataframe) I will always recommend that you use the .loc and .iloc indexing accessors. Essentially by using these accessors you are explicitly telling pandas that loc: "This slice will be based on the ordering of the index" or that iloc: "This slice will be based on the ordering of the values". The tricky part comes when you don't use loc/iloc such as your case AND have a numeric index. When you don't use either, pandas tries to infer whether you're referring to the index ordering or the ordering of the values. Essentially, if you slice with a range of numbers, pandas assumes that you are trying to use the position of the values- ignoring the index.

import pandas as pd

data = pd.Series([5,6,7,8,9], index=range(10, 15))
print(data)

10    5
11    6
12    7
13    8
14    9
dtype: int64

Using .loc to get values that correspond to the slice "a" to "c" from the index:

# Slice based on the index values 11 to 13
data.loc[11:13]
11    6
12    7
13    8
dtype: int64

However if we want the values based on their position in the Series, we use iloc. You will also note that iloc produces slices that are not inclusive of the final value (e.g. we only return elements 1 and 2, and omit 3 in the example below). Whereas in the example above using loc, we returned the elements corresponding to 11, 12, and 13 in the index.

data.iloc[1:3]
11    6
12    7
dtype: int64

Now that that has been said, I hope you understand why it's extremely unclear what this means:

data[11:13]

Are we asking pandas to find in the index where the value 11 to 13 exist and give us that slice? Or are we asking for the 12th and 13th elements of this Series? In this case, pandas used the latter (see below). However, I would encourage you to always slice into a Series or DataFrame using either loc or .iloc due to avoid this ambiguity.

data[11:13]
Series([], dtype: int64)

And that's just for slicing on an integer based index. Your problem comes from how pandas implements a floating type index (here's the real mind twister):

data.index = data.index.astype("float")
print(data)
10.0    5
11.0    6
12.0    7
13.0    8
14.0    9
dtype: int64

Now all of the sudden, you can do this and it returns the values as if you used .loc:

data[11:13]
11.0    6
12.0    7
13.0    8
dtype: int64

So what gives? Essentially, decisions had to be made. There needed to be some type of default behavior for slicing into a Series, and it unfortunately depends on the index making it feel unstable across index data types. Thankfully you can avoid all of this confusion, by using loc and iloc.

Confusion with pandas Series slicing

Answers (1)

Related Questions