Reputation: 305
I have a pandas DataFrame where one column is the trip distance a taxi has covered. Im using
value_counts()
on this column in order to see the most common trip distances.
b = df['trip_distance'].value_counts()
Object b is a pandas Series object. For the sake of completion, the first 5 rows of this Series are
1.00 21815
0.90 18915
0.80 18449
1.10 18263
1.20 17823
Where this means that the most common trip distance is 1, and it appears 21815 times and the same for the rest.
However, if I type b[0:4]
instead of printing the 4 first elements of this Series, it finds the element that corresponds to 0
trip distance, and starts printing all the trip distances until it reaches trip distance 4. Of course, if trip distance 4 comes before trip distance 0, it returns an empty Series.
Nevertheless, when I try it on a custom Series
a = pd.Series([3, 1, 2, 3, 4, 4, 5]).value_counts()
Printing a
gives
4 2
3 2
5 1
2 1
1 1
and when I try to slice this series, that is, when I type a[0, 3]
I get the expected
4 2
3 2
5 1
Does anyone know why this is happening? I know this can be done with iloc/loc, im just curious about why slicing works in one list but not the other.
Thanks in advance.
Upvotes: 2
Views: 1463
Reputation: 13407
When indexing values from a Series (or rows from a dataframe) I will always recommend that you use the .loc
and .iloc
indexing accessors. Essentially by using these accessors you are explicitly telling pandas that loc
: "This slice will be based on the ordering of the index" or that iloc
: "This slice will be based on the ordering of the values". The tricky part comes when you don't use loc/iloc such as your case AND have a numeric index. When you don't use either, pandas tries to infer whether you're referring to the index ordering or the ordering of the values. Essentially, if you slice with a range of numbers, pandas assumes that you are trying to use the position of the values- ignoring the index.
import pandas as pd
data = pd.Series([5,6,7,8,9], index=range(10, 15))
print(data)
10 5
11 6
12 7
13 8
14 9
dtype: int64
Using .loc
to get values that correspond to the slice "a" to "c" from the index:
# Slice based on the index values 11 to 13
data.loc[11:13]
11 6
12 7
13 8
dtype: int64
However if we want the values based on their position in the Series, we use iloc
. You will also note that iloc
produces slices that are not inclusive of the final value (e.g. we only return elements 1 and 2, and omit 3 in the example below). Whereas in the example above using loc, we returned the elements corresponding to 11, 12, and 13 in the index.
data.iloc[1:3]
11 6
12 7
dtype: int64
Now that that has been said, I hope you understand why it's extremely unclear what this means:
data[11:13]
Are we asking pandas to find in the index where the value 11 to 13 exist and give us that slice? Or are we asking for the 12th and 13th elements of this Series? In this case, pandas used the latter (see below). However, I would encourage you to always slice into a Series or DataFrame using either loc
or .iloc
due to avoid this ambiguity.
data[11:13]
Series([], dtype: int64)
And that's just for slicing on an integer based index. Your problem comes from how pandas implements a floating type index (here's the real mind twister):
data.index = data.index.astype("float")
print(data)
10.0 5
11.0 6
12.0 7
13.0 8
14.0 9
dtype: int64
Now all of the sudden, you can do this and it returns the values as if you used .loc
:
data[11:13]
11.0 6
12.0 7
13.0 8
dtype: int64
So what gives? Essentially, decisions had to be made. There needed to be some type of default behavior for slicing into a Series, and it unfortunately depends on the index making it feel unstable across index data types. Thankfully you can avoid all of this confusion, by using loc
and iloc
.
Upvotes: 3