Reputation: 4861
Assuming the following 3 Pandas Dataframes:
df1:
ask.c ask.h ask.l ask.o bid.c bid.h bid.l bid.o complete mid.c mid.h mid.l mid.o volume date
0 1.39146 1.39148 1.39146 1.39146 1.39138 1.39140 1.39136 1.39136 True 1.39142 1.39144 1.39141 1.39141 5 2014-03-19 09:00:00
. .
. .
. .
df2:
ask.c ask.h ask.l ask.o bid.c bid.h bid.l bid.o complete mid.c mid.h mid.l mid.o volume date
0 1.39147 1.39148 1.39147 1.39147 1.39138 1.39138 1.39137 1.39137 True 1.39142 1.39142 1.39142 1.39142 4 2014-03-19 09:00:05
1 1.39149 1.39149 1.39148 1.39149 1.39138 1.39141 1.39138 1.39141 True 1.39143 1.39145 1.39143 1.39145 3 2014-03-19 09:00:10
. .
. .
. .
df is built with:
df = pandas.concat([df1,df2], ignore_index=True)
df.drop_duplicates("date", keep="first", inplace=True)
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")
Why is df["ask.c"]["2014-03-19 09:00:00"]
returning a Panda Series instead of a scalar/string value ?
print(type(df1["ask.c"]["2014-03-19 09:00:00"]))¬
<class 'pandas.core.series.Series'>
print(df1["ask.c"]["2014-03-19 09:00:00"])¬
Date
2014-03-19 09:00:00 1.39146
Name: ask.c, dtype: object
EDIT:
@IIya V.Schurov's explanation got me on the right track but we are not quite fully there yet. Pandas has something called "DatetimeIndex Partial String Indexing" which is a form slicing.
Hence the expression df["2016-12-07"]
would return a series rather than a scalar. The documentation mentions that df["2016-12-07 09:00:00"] is
not a slice nor resolve to one and will raise a KeyError
but in my case, it does not raise any KeyError and does resolve like a slice.
Upvotes: 1
Views: 1211
Reputation: 8047
Adding a new answer to keep previous one for historical purposes.
import pandas as pd
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
'2016-12-07 09:00:05',
'2016-12-08 09:00:10']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'pandas.core.series.Series'>
As the docs discuss, one can use string on DateTimeIndex
's __getitem__/[]
operator to get so-called Partial String Indexing. It works like this:
series['2016-12-07']
# give me all records for December, 07 2016
2016-12-07 09:00:00 1
2016-12-07 09:00:05 2
dtype: int64
One can explicitly provide a timestamp to get scalar instead of Series
:
print(type(series[pd.to_datetime("2016-12-07 09:00:00")]))
# <class 'numpy.int64'>
String used as subscript can be detected as coerceable to slice or not. In the former case, an output of []
is Series
. Otherwise, it is scalar (or KeyError
). Moreover, if string is detected as coercable to slice, it can be used to index DataFrame
as a slice (i.e. row-wise, not column-wise):
df = pd.DataFrame(series)
print(df["2016-12-07"])
0
2016-12-07 09:00:00 1
2016-12-07 09:00:05 2
When string can be considered as a slice? It depends on the resolution of this string and DateTimeIndex
instance's resolution. Examples:
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-06 23:59:00'
'2016-12-07 01:00:00',
'2016-12-07 01:01:00',
'2016-12-07 01:02:00']))
print(series.index.resolution)
# minute
print(type(series["2016-12-06"]))
# <class 'pandas.core.series.Series'>
print(type(series["2016-12-07"]))
# <class 'pandas.core.series.Series'>
print(type(series["2016-12-07 01"]))
# <class 'pandas.core.series.Series'>
print(type(series["2016-12-07 01:01"]))
# <class 'numpy.int64'>
So generally string is a slice if its resolution is less than the resolution of DateTimeIndex
instance. However, there is an exception for second
resolution.
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-06 23:59:00'
'2016-12-07 01:00:00',
'2016-12-07 01:01:00',
'2016-12-07 01:02:01']))
print(series.index.resolution)
# second
# this is due to 1 second at the last timestamp
print(type(series["2016-12-07 01:01:00"]))
# <class 'pandas.core.series.Series'>
I can't answer, why this decision was made (see the discussion here for the references in the code), but it definitely the OP's case (as df2['date']
's resolution is second
). So every timestamp with second
resolution will be resolved to slice.
UPD. Finally the behavior of slicing for second
resolution is fixed (in currrent master, pending to 0.20 release).
Upvotes: 2
Reputation: 8047
This answer is kept for historical purposes only. Please, refer to the other answer.
Okay, I got it. First of all, here is a simple reproduction:
import pandas as pd
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
'2016-12-07 09:00:05',
'2016-12-08 09:00:10']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'pandas.core.series.Series'>
My first idea was to replace the string with an actual Timestamp
instance.
print(type(series[pd.to_datetime("2016-12-07 09:00:00")]))
# <class 'numpy.int64'>
Woah, it works! Then I tried to figure out, why pandas behave this way, and I runned the code above in the debugger. My first suggestion was it treats strings as sequences and therefore produces new Series
instead of scalar value, but in fact it is more interesting. This behavior is due to fact that you indeed can pick a series of values with single string, like this:
series['2016-12-07']
2016-12-07 09:00:00 1
2016-12-07 09:00:05 2
dtype: int64
So pandas can understand that a string can represent date-time with some precision and if DateTimeIndex
has several records that agrees with the string (up to precision given by the string), it returns all of them.
As it cannot understand, do you really want to get an interval or just one record from the datetime-like string, it returns an interval. Thus Series
instead of scalar.
Morale: date-time is tricky. Explicit is better then implicit. Use explicit conversion to Timestamp
and the Force will be with you.
EDIT. (Re: EDIT in the question.)
It seems that the behaviour depends on periodicity of index. If I make index periodic, I will get the following output:
series = pd.Series([1, 2, 3], pd.DatetimeIndex(['2016-12-07 09:00:00',
'2016-12-08 09:00:00',
'2016-12-09 09:00:00']))
print(type(series["2016-12-07 09:00:00"]))
# <class 'numpy.int64'>
Even more interesting:
df = pd.DataFrame(series)
df['2016-12-07 09:00:00']
KeyError: '2016-12-07 09:00:00'
I believe that this is the thing described in the docs:
Warning The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)
However, it is seem to me a bit strange that behaviour of Series
is not the same as in DataFrame
and that it depends so heavily on periodicity of index.
Upvotes: 0