David Andreoletti
David Andreoletti

Reputation: 4861

Why is Pandas not returing a scalar/string instead of a Series when accessing a DataFrame with both a column and an index value?

Assuming the following 3 Pandas Dataframes:

df1:

   ask.c    ask.h    ask.l    ask.o    bid.c    bid.h    bid.l    bid.o complete    mid.c    mid.h    mid.l    mid.o  volume     date 

0  1.39146  1.39148  1.39146  1.39146  1.39138  1.39140  1.39136  1.39136     True  1.39142  1.39144  1.39141  1.39141      5    2014-03-19 09:00:00
.  .  
.  . 
.  .

df2:

   ask.c    ask.h    ask.l    ask.o    bid.c    bid.h    bid.l    bid.o complete    mid.c    mid.h    mid.l    mid.o  volume   date

0  1.39147  1.39148  1.39147  1.39147  1.39138  1.39138  1.39137  1.39137     True  1.39142  1.39142  1.39142  1.39142    4    2014-03-19 09:00:05
1  1.39149  1.39149  1.39148  1.39149  1.39138  1.39141  1.39138  1.39141     True  1.39143  1.39145  1.39143  1.39145    3    2014-03-19 09:00:10
.  .  
.  . 
.  .

df is built with:

df = pandas.concat([df1,df2], ignore_index=True)
df.drop_duplicates("date", keep="first", inplace=True)
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")

Why is df["ask.c"]["2014-03-19 09:00:00"] returning a Panda Series instead of a scalar/string value ?

print(type(df1["ask.c"]["2014-03-19 09:00:00"]))¬                                                                              
<class 'pandas.core.series.Series'>

print(df1["ask.c"]["2014-03-19 09:00:00"])¬
Date
2014-03-19 09:00:00    1.39146
Name: ask.c, dtype: object

EDIT:

@IIya V.Schurov's explanation got me on the right track but we are not quite fully there yet. Pandas has something called "DatetimeIndex Partial String Indexing" which is a form slicing.

Hence the expression df["2016-12-07"] would return a series rather than a scalar. The documentation mentions that df["2016-12-07 09:00:00"] isnot a slice nor resolve to one and will raise a KeyError but in my case, it does not raise any KeyError and does resolve like a slice.

Upvotes: 1

Views: 1211

Answers (2)

Ilya V. Schurov
Ilya V. Schurov

Reputation: 8047

Adding a new answer to keep previous one for historical purposes.

Minimal example

import pandas as pd
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
                                                '2016-12-07 09:00:05',
                                                '2016-12-08 09:00:10']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'pandas.core.series.Series'>

The documentation

As the docs discuss, one can use string on DateTimeIndex's __getitem__/[] operator to get so-called Partial String Indexing. It works like this:

series['2016-12-07']
# give me all records for December, 07 2016

2016-12-07 09:00:00    1
2016-12-07 09:00:05    2
dtype: int64

One can explicitly provide a timestamp to get scalar instead of Series:

print(type(series[pd.to_datetime("2016-12-07 09:00:00")]))
# <class 'numpy.int64'>

Resolutions

String used as subscript can be detected as coerceable to slice or not. In the former case, an output of [] is Series. Otherwise, it is scalar (or KeyError). Moreover, if string is detected as coercable to slice, it can be used to index DataFrame as a slice (i.e. row-wise, not column-wise):

df = pd.DataFrame(series)
print(df["2016-12-07"])

                     0
2016-12-07 09:00:00  1
2016-12-07 09:00:05  2

When string can be considered as a slice? It depends on the resolution of this string and DateTimeIndex instance's resolution. Examples:

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-06 23:59:00'
                                                '2016-12-07 01:00:00',
                                                '2016-12-07 01:01:00',
                                                '2016-12-07 01:02:00']))
print(series.index.resolution)
# minute
print(type(series["2016-12-06"]))
# <class 'pandas.core.series.Series'>
print(type(series["2016-12-07"]))
# <class 'pandas.core.series.Series'>
print(type(series["2016-12-07 01"]))
# <class 'pandas.core.series.Series'>
print(type(series["2016-12-07 01:01"]))
# <class 'numpy.int64'>

So generally string is a slice if its resolution is less than the resolution of DateTimeIndex instance. However, there is an exception for second resolution.

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-06 23:59:00'
                                                '2016-12-07 01:00:00',
                                                '2016-12-07 01:01:00',
                                                '2016-12-07 01:02:01']))
print(series.index.resolution)
# second
# this is due to 1 second at the last timestamp
print(type(series["2016-12-07 01:01:00"]))
# <class 'pandas.core.series.Series'>

I can't answer, why this decision was made (see the discussion here for the references in the code), but it definitely the OP's case (as df2['date']'s resolution is second). So every timestamp with second resolution will be resolved to slice.


UPD. Finally the behavior of slicing for second resolution is fixed (in currrent master, pending to 0.20 release).

Upvotes: 2

Ilya V. Schurov
Ilya V. Schurov

Reputation: 8047

This answer is kept for historical purposes only. Please, refer to the other answer.


Okay, I got it. First of all, here is a simple reproduction:

import pandas as pd
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
                                                '2016-12-07 09:00:05',
                                                '2016-12-08 09:00:10']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'pandas.core.series.Series'>

My first idea was to replace the string with an actual Timestamp instance.

print(type(series[pd.to_datetime("2016-12-07 09:00:00")]))
# <class 'numpy.int64'>

Woah, it works! Then I tried to figure out, why pandas behave this way, and I runned the code above in the debugger. My first suggestion was it treats strings as sequences and therefore produces new Series instead of scalar value, but in fact it is more interesting. This behavior is due to fact that you indeed can pick a series of values with single string, like this:

series['2016-12-07']

2016-12-07 09:00:00    1
2016-12-07 09:00:05    2
dtype: int64

So pandas can understand that a string can represent date-time with some precision and if DateTimeIndex has several records that agrees with the string (up to precision given by the string), it returns all of them.

As it cannot understand, do you really want to get an interval or just one record from the datetime-like string, it returns an interval. Thus Series instead of scalar.

Morale: date-time is tricky. Explicit is better then implicit. Use explicit conversion to Timestamp and the Force will be with you.


EDIT. (Re: EDIT in the question.)

It seems that the behaviour depends on periodicity of index. If I make index periodic, I will get the following output:

series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
                                                '2016-12-08 09:00:00',
                                                '2016-12-09 09:00:00']))
print(type(series["2016-12-07 09:00:00"]))

# <class 'numpy.int64'>

Even more interesting:

df = pd.DataFrame(series)
df['2016-12-07 09:00:00']

KeyError: '2016-12-07 09:00:00'

I believe that this is the thing described in the docs:

Warning The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)

However, it is seem to me a bit strange that behaviour of Series is not the same as in DataFrame and that it depends so heavily on periodicity of index.

Upvotes: 0

Related Questions