Philip O'Brien
Philip O'Brien

Reputation: 4266

Pandas DatetimeIndex from MongoDB ISODate

I am having some difficulty working with times/timezones. I have raw JSON data of the form

{
  "Date": "28 Sep 2009 00:00:00",
  ....
}

This data is then loaded into MongoDB, and this string representation of the date is transformed into a JavaScript Date object. This conversion to UTC time results in the following date

{
  "_id": ObjectId("577a788f4439e17afd4e21f7"),
  "Date": ISODate("2009-09-27T23:00:00Z")
}

It "looks" as though the date has actually been moved forward a day, I'm assuming (perhaps incorrectly) that this is because my machine is set to Irish Standard Time.

I then read this data from MongoDB and use it to create a pandas DatetimeIndex

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')

which gives me

enter image description here

which is incorrect since the time has not been converted back correctly from UTC to local time. So I followed the solution given in this answer

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
frame = DataFrame(test_docs, index=idx)
frame = frame.drop('Date', 1)

which gives me the right day back

enter image description here

I then normalize the DatetimeIndex so the hours are removed, allowing me to group all entries by day.

frame.groupby(idx).sum()

At this point, however, something strange happens. The dates end up getting grouped as follows

enter image description here

but this doesn't reflect the dates in the frame

enter image description here

Can anyone shed some light on where I might be going wrong?


Response to @ptrj

Explicitly using my timezone as a string

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz='Europe/Dublin')
idx = idx.normalize()
frame = DataFrame(test_docs, index=idx)
...
...
aggregate = frame.groupby(idx).sum()
aggregate.plot()

this doesn't work for me, it results in the following plot

enter image description here

For some reason the groupby is not properly grouping for 2014, as shown below

enter image description here

If instead, I use

idx = idx.tz_convert(tz.gettz('Europe/Dublin'))

I get the same problem

Convert to an object

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
idx = idx.normalize()
frame = DataFrame(test_docs, index=idx)
aggregate = frame.groupby(idx.astype(object)).sum()

This approach seems to work correctly for me

enter image description here

Upvotes: 7

Views: 1767

Answers (2)

ptrj
ptrj

Reputation: 5222

I was able to reproduce the error with the following data:

idx0 = pd.date_range('2011-11-11', periods=4)
idx1 = idx0.tz_localize(tz.tzutc())
idx2 = idx1.tz_convert(tz.tzlocal())
df = pd.DataFrame([1, 2, 3, 4])

df.groupby(idx2).sum()
Out[20]: 
                           0
1970-01-01 00:00:00-05:00  9
2011-11-10 19:00:00-05:00  1

It's a bug deep in the pandas code, related exclusively to tz.tzlocal(). It manifests itself also in:

idx2.tz_localize(None)
Out[27]: 
DatetimeIndex(['2011-11-10 19:00:00', '1970-01-01 00:00:00',
               '1970-01-01 00:00:00', '1970-01-01 00:00:00'],
              dtype='datetime64[ns]', freq='D')

You can use any of the following solutions:

  • use explicitly your timezone as a string:

    idx2 = idx1.tz_convert(tz='Europe/Dublin')
    df.groupby(idx2).sum()
    Out[29]: 
                               0
    2011-11-11 00:00:00+00:00  1
    2011-11-12 00:00:00+00:00  2
    2011-11-13 00:00:00+00:00  3
    2011-11-14 00:00:00+00:00  4
    

    or if it doesn't work:

    idx2 = idx1.tz_convert(tz.gettz('Europe/Dublin'))
    
  • convert it to an object:

    df.groupby(idx2.astype(object)).sum()
    Out[32]: 
                               0
    2011-11-10 19:00:00-05:00  1
    2011-11-11 19:00:00-05:00  2
    2011-11-12 19:00:00-05:00  3
    2011-11-13 19:00:00-05:00  4
    

Basically, converting to anything else than DatetimeIndex with tz=tz.local() should work.


EDIT: This bug has been just fixed on pandas github. The fix will be available in pandas 0.19 release.

Upvotes: 3

Philip O'Brien
Philip O'Brien

Reputation: 4266

I have managed to get around this for now by changing my groupby to the following

frame.groupby([pd.DatetimeIndex([x.date() for x in frame.index])]).sum()

so where I was originally trying to groupby

idx =  pd.DatetimeIndex([x['Date'] for x in test_docs], freq='D')
idx = idx.tz_localize(tz=tz.tzutc())
idx = idx.tz_convert(tz=tz.tzlocal())
frame.groupby(idx).sum()

I am now calling the date method on each element of the index prior to performing the groupby operation.

I'm posting this as an answer in case nobody replies, but I am hoping for someone to answer and explain what is happening, as my 'solution' seems too hacky for my tastes.

Upvotes: 0

Related Questions