Reputation: 17017
I am reading multiple spreadsheets of timeseries into a pandas dataFrame and concatenating them together with a common pandas datetime index. The datalogger that logged the timeseries is not 100% accurate which makes resampling very annoying because depending on if the time is slightly higher or lower than the interval being sampled it will create NaNs and starts to make my series look like a broken line. Here's my code
def loaddata(filepaths):
t1 = time.clock()
for i in range(len(filepaths)):
xl = pd.ExcelFile(filepaths[i])
df = xl.parse(xl.sheet_names[0], header=0, index_col=2, skiprows=[0,2,3,4], parse_dates=True)
df = df.dropna(axis=1, how='all')
df = df.drop(['Decimal Year Day', 'Decimal Year Day.1', 'RECORD'], axis=1)
if i == 0:
dfs = df
else:
dfs = concat([dfs, df], axis=1)
t2 = time.clock()
print "Files loaded into dataframe in %s seconds" %(t2-t1)
files = ["London Lysimeters corrected 5min.xlsx", "London Water Balance 5min.xlsx"]
data = loaddata(files)
Here's an idea of the index:
data.index
class 'pandas.tseries.index.DatetimeIndex'> [2012-08-27 12:05:00.000002, ..., 2013-07-12 15:10:00.000004] Length: 91910, Freq: None, Timezone: None
What would be the fastest and most general to round the index to the nearest minute?
Upvotes: 4
Views: 3816
Reputation: 5440
Issue 4314 mentioned by Jeff is now closed and round()
method was added for DatetimeIndex, Timestamp, TimedeltaIndex and Timedelta in pandas 0.18.0. Now we can do the following:
In[109]: index = pd.DatetimeIndex([pd.Timestamp('20120827 12:05:00.002'), pd.Timestamp('20130101 12:05:01'), pd.Timestamp('20130712 15:10:30'), pd.Timestamp('20130712 15:10:31')])
In[110]: index.values
Out[110]:
array(['2012-08-27T12:05:00.002000000', '2013-01-01T12:05:01.000000000',
'2013-07-12T15:10:30.000000000', '2013-07-12T15:10:31.000000000'], dtype='datetime64[ns]')
In[111]: index.round('min')
Out[111]:
DatetimeIndex(['2012-08-27 12:05:00', '2013-01-01 12:05:00',
'2013-07-12 15:10:00', '2013-07-12 15:11:00'],
dtype='datetime64[ns]', freq=None)
round()
accepts frequency parameter. String aliases for it are listed here.
Upvotes: 4
Reputation: 2485
For data columns; Use: round_hour(df.Start_time)
def round_hour(x,tt=''):
if tt=='M':
return pd.to_datetime(((x.astype('i8')/(1e9*60)).round()*1e9*60).astype(np.int64))
elif tt=='H':
return pd.to_datetime(((x.astype('i8')/(1e9*60*60)).round()*1e9*60*60).astype(np.int64))
else:
return pd.to_datetime(((x.astype('i8')/(1e9)).round()*1e9).astype(np.int64))
Upvotes: 0
Reputation: 129048
Here's a little trick. Datetimes are in nanoseconds (when viewed as np.int64
).
So round to minutes in nanoseconds.
In [75]: index = pd.DatetimeIndex([ Timestamp('20120827 12:05:00.002'), Timestamp('20130101 12:05:01'), Timestamp('20130712 15:10:00'), Timestamp('20130712 15:10:00.000004') ])
In [79]: index.values
Out[79]:
array(['2012-08-27T08:05:00.002000000-0400',
'2013-01-01T07:05:01.000000000-0500',
'2013-07-12T11:10:00.000000000-0400',
'2013-07-12T11:10:00.000004000-0400'], dtype='datetime64[ns]')
In [78]: pd.DatetimeIndex(((index.asi8/(1e9*60)).round()*1e9*60).astype(np.int64)).values
Out[78]:
array(['2012-08-27T08:05:00.000000000-0400',
'2013-01-01T07:05:00.000000000-0500',
'2013-07-12T11:10:00.000000000-0400',
'2013-07-12T11:10:00.000000000-0400'], dtype='datetime64[ns]')
Upvotes: 6