Reputation: 337
I have a large Pandas dataframe in which one column is (unordered) datetimes from a known period (the year 2013). I need an efficient way to convert these datetimes to indices, where each index = # hours since start_time ('2013-1-1 00)'. There are duplicate times, which should map to duplicate indices.
Obviously, this can be done one-at-a-time with a loop by using timedelta. It can also be done with a loop by using Pandas Series (see the following snippet, which generates the ordered series of all datetimes since start_time):
nhours = 365*24
time_series = Series(range(nhours), index=pd.date_range('2013-1-1', periods=nhours, freq='H'))
After running this snippet, one can get indices using the .index or .get_loc methods in a loop.
** However, is there a fast (non-loopy?) way to take a column of arbitrary datetimes and find their respective indices? **
For example, inputing the following column of datetimes:
2013-01-01 11:00:00
2013-01-01 11:00:00
2013-01-01 00:00:00
2013-12-30 18:00:00
should output the following indices: [11, 11, 0, 8730]
Upvotes: 1
Views: 529
Reputation: 3761
Use isin
:
time_series[time_series.index.isin(['2013-01-01 11:00:00',
'2013-01-01 00:00:00',
'2013-12-30 18:00:00'])].values
# Returns: array([ 0, 11, 8730])
between
and between_time
are also useful
Upvotes: 0
Reputation: 337
Thank you for the responses. I have a new, faster solution that takes advantage of the fact that pandas supports datetime and timedelta formats. It turns out that the following is roughly twice as fast as Colin's solution above (although not as flexible), and it avoids the overhead of building a Series of ordered datetimes:
all_indices = (df['mydatetimes'] - datetime(2013,1,1,0)) / np.timedelta64(1,'h')
where df is the pandas dataframe and 'mydatetimes' is the column name that includes the datetimes.
Timing the code yields that this solution performs 30,000 indices in:
0:00:00.009909 --> this snippet
0:00:00.017800 --> Colin's solution with ts=Series(...) and ts.loc. I have excluded the one-time overhead of building a Series from this timing
Upvotes: 0
Reputation: 2137
loc can take a list or array of labels to look up:
>>> print time_series.loc[[pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101'), pd.Timestamp('20131230 18:00')]]
2013-01-01 11:00:00 11
2013-01-01 11:00:00 11
2013-01-01 00:00:00 0
2013-12-30 18:00:00 8730
dtype: int64
Upvotes: 2