Reputation: 2270
I have a pandas dataframe as follows:
Date Symbol
0 2015-02-20 A
1 2016-01-15 A
2 2016-01-15 A
3 2015-08-21 A
I need the output to be a list of the unique dates in epoch time, sorted. Like the following:
[1424390400, 1440115200, 1452816000]
But when I return the unique dates from the Dataframe I get:
['2015-02-19T18:00:00.000000000-0600' '2016-01-14T18:00:00.000000000-0600'
'2015-08-20T19:00:00.000000000-0500']
Which shows hours that are not correct; they end up being a whole day off.
How do I sort by date, return unique, and then convert to epoch with the hour of 00:00:00? Thanks in advance.
Here's a demo script of what I have so far:
import pandas as pd
df =pd.DataFrame( {'Symbol':['A','A','A','A'] ,'Date'
['02/20/2015','01/15/2016', '01/15/2016','08/21/2015']})
df['Date'] = pd.to_datetime(df.Date)
dates = df['Date'].unique()
print dates
Upvotes: 1
Views: 2348
Reputation: 128948
This is not a bug, rather a display 'issue' with numpy. Numpy chooses to display the datetimes in local time zone (for historical reasons I think).
In [21]: df['Date'].unique()
Out[21]:
array(['2015-02-19T19:00:00.000000000-0500',
'2016-01-14T19:00:00.000000000-0500',
'2015-08-20T20:00:00.000000000-0400'], dtype='datetime64[ns]')
Convert back to a DatetimeIndex (pandas returns unique as a numpy-array for compat reasons).
In [22]: pd.DatetimeIndex(df['Date'].unique())
Out[22]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-02-20, ..., 2015-08-21]
Length: 3, Freq: None, Timezone: None
A little math gets you epoch seconds. The underling data is stored in ns
precision.
In [23]: pd.DatetimeIndex(df['Date'].unique()).asi8/10**9
Out[23]: array([1424390400, 1452816000, 1440115200])
Note that this is all vectorized so will be quite fast.
Upvotes: 3
Reputation: 4089
After some experimentation, it appears that the hours are getting changed when you call df['Date'].unique()
. The Pandas unique()
method returns a Numpy ndarray. So this actually casts your column elements from pandas Timestamps to Numpy datetime64 types. This process is applying the weird timezones and shifts in hours.
To get a list of unique epoch time integers, try:
dates = df['Date'].map(lambda t: t.value / (10**9)).unique()
Since the elements of df['Date']
are timestamps, we can get the nanosecond value by calling t.value
and then divide by 10^9 to convert to epoch seconds.
Upvotes: 1
Reputation: 2270
Heres the only hack I could come up with.
import pandas as pd
import numpy as np
df =pd.DataFrame( {'Symbol':['A','A','A','A'] ,'Date':['02/20/2015','01/15/2016', '01/15/2016','08/21/2015']})
df['Date'] = pd.to_datetime(df.Date)
df = df.sort('Date')
dates = []
seen = []
for i in df['Date']:
if i in seen:
continue
else:
seen.append(i)
dates.append(str(i).split(' ')[0])
if len(dates)==2:
break
print dates
And then its converted to epoch using:
for i in dates:
cur_dates = i.split('-')
epoch = int((datetime(int(cur_dates[0]),int(cur_dates[1]),int(cur_dates[2]),0,0) - datetime(1970,1,1)).total_seconds())
epoch = str(epoch)
Upvotes: 0