nicholas.reichel
nicholas.reichel

Reputation: 2270

Sort Pandas Dataframe by Date, Return unique Dates and then Epoch time for the top two

I have a pandas dataframe as follows:

        Date Symbol
0 2015-02-20      A
1 2016-01-15      A
2 2016-01-15      A
3 2015-08-21      A

I need the output to be a list of the unique dates in epoch time, sorted. Like the following: [1424390400, 1440115200, 1452816000]

But when I return the unique dates from the Dataframe I get: ['2015-02-19T18:00:00.000000000-0600' '2016-01-14T18:00:00.000000000-0600' '2015-08-20T19:00:00.000000000-0500']

Which shows hours that are not correct; they end up being a whole day off.

How do I sort by date, return unique, and then convert to epoch with the hour of 00:00:00? Thanks in advance.

Here's a demo script of what I have so far:

import pandas as pd

df =pd.DataFrame( {'Symbol':['A','A','A','A'] ,'Date'
    ['02/20/2015','01/15/2016', '01/15/2016','08/21/2015']})

df['Date'] = pd.to_datetime(df.Date)
dates = df['Date'].unique()
print dates

Upvotes: 1

Views: 2348

Answers (3)

Jeff
Jeff

Reputation: 128948

This is not a bug, rather a display 'issue' with numpy. Numpy chooses to display the datetimes in local time zone (for historical reasons I think).

In [21]: df['Date'].unique()
Out[21]: 
array(['2015-02-19T19:00:00.000000000-0500',
       '2016-01-14T19:00:00.000000000-0500',
       '2015-08-20T20:00:00.000000000-0400'], dtype='datetime64[ns]')

Convert back to a DatetimeIndex (pandas returns unique as a numpy-array for compat reasons).

In [22]: pd.DatetimeIndex(df['Date'].unique())
Out[22]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-20, ..., 2015-08-21]
Length: 3, Freq: None, Timezone: None

A little math gets you epoch seconds. The underling data is stored in ns precision.

In [23]: pd.DatetimeIndex(df['Date'].unique()).asi8/10**9
Out[23]: array([1424390400, 1452816000, 1440115200])

Note that this is all vectorized so will be quite fast.

Upvotes: 3

andrew
andrew

Reputation: 4089

After some experimentation, it appears that the hours are getting changed when you call df['Date'].unique(). The Pandas unique() method returns a Numpy ndarray. So this actually casts your column elements from pandas Timestamps to Numpy datetime64 types. This process is applying the weird timezones and shifts in hours.

To get a list of unique epoch time integers, try:

dates = df['Date'].map(lambda t: t.value / (10**9)).unique()

Since the elements of df['Date'] are timestamps, we can get the nanosecond value by calling t.value and then divide by 10^9 to convert to epoch seconds.

Upvotes: 1

nicholas.reichel
nicholas.reichel

Reputation: 2270

Heres the only hack I could come up with.

import pandas as pd
import numpy as np
df =pd.DataFrame( {'Symbol':['A','A','A','A'] ,'Date':['02/20/2015','01/15/2016', '01/15/2016','08/21/2015']})

df['Date'] = pd.to_datetime(df.Date)
df = df.sort('Date')
dates = []
seen = []
for i in df['Date']:
    if i in seen:
        continue
    else:
        seen.append(i)
        dates.append(str(i).split(' ')[0])
    if len(dates)==2: 
        break
print dates

And then its converted to epoch using:

for i in dates:
    cur_dates = i.split('-')
    epoch = int((datetime(int(cur_dates[0]),int(cur_dates[1]),int(cur_dates[2]),0,0) - datetime(1970,1,1)).total_seconds())
    epoch = str(epoch)

Upvotes: 0

Related Questions