Sashank
Sashank

Reputation: 600

Inconsistent slicing [:] behavior on Pandas Dataframes

I have 2 data frames. First dataframe has numbers as index. Second dataframe has datetime as index. The slice operator (:) behaves differently on these dataframes.

Case 1

>>> df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])
>>> df
   A
0  1
1  2
2  3
>>> df [0:2]
   A
0  1
1  2

Case 2

>>> a = dt.datetime(2000,1,1)
>>> b = dt.datetime(2000,1,2)
>>> c = dt.datetime(2000,1,3)
>>> df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])
>>> df
            A
2000-01-01  1
2000-01-02  2
2000-01-03  3
>>> df[a:b]
            A
2000-01-01  1
2000-01-02  2

Why does the final row gets excluded in case 1 but not in case 2?

Upvotes: 1

Views: 189

Answers (2)

jezrael
jezrael

Reputation: 862611

Dont use it, better is use loc for consistency:

df = pd.DataFrame({'A':[1,2,3]}, index=[0,1,2])

print (df.loc[0:2])
   A
0  1
1  2
2  3

a = datetime.datetime(2000,1,1)
b = datetime.datetime(2000,1,2)
c = datetime.datetime(2000,1,3)
df = pd.DataFrame({'A':[1,2,3]}, index = [a,b,c])

print (df.loc[a:b])
            A
2000-01-01  1
2000-01-02  2

Reason, why last row is omitted is possible find in docs:

With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.

print (df[0:2])
   A
0  1
1  2

For selecting by datetimes exact indexing is used :

... In contrast, indexing with Timestamp or datetime objects is exact, because the objects have exact meaning. These also follow the semantics of including both endpoints.

Upvotes: 5

anand_v.singh
anand_v.singh

Reputation: 2838

Okay to understand this first let's run an experiment

import pandas as pd
import datetime as dt
a = dt.datetime(2000,1,1)
b = dt.datetime(2000,1,2)
c = dt.datetime(2000,1,3)
df = pd.DataFrame({'A':[4,5,6]}, index=[a,b,c])

Now let's use

df2[0:2]

Which gives us

            A
2000-01-01  1
2000-01-02  2

Now this behavior is consistent through python and list slicing, but if you use df[a:c]

You get

    A
2000-01-01  1
2000-01-02  2
2000-01-03  3

this is because df[a:c] overrides the default list slicing method as indexes do not correspond to integers, and in the function written in Pandas which also includes the last element, so if your indexes were integers, pandas defaults to inbuilt slicing, whereas if they are not integers, this effect is observed, as already mentioned in the answer by jezrael, it is better to use loc, as that has more consistency across the board.

Upvotes: 1

Related Questions