cataclysmic
cataclysmic

Reputation: 337

Python (pandas) fast mapping of multiple datetimes to their series indices?

I have a large Pandas dataframe in which one column is (unordered) datetimes from a known period (the year 2013). I need an efficient way to convert these datetimes to indices, where each index = # hours since start_time ('2013-1-1 00)'. There are duplicate times, which should map to duplicate indices.

Obviously, this can be done one-at-a-time with a loop by using timedelta. It can also be done with a loop by using Pandas Series (see the following snippet, which generates the ordered series of all datetimes since start_time):

nhours = 365*24
time_series = Series(range(nhours), index=pd.date_range('2013-1-1', periods=nhours, freq='H'))

After running this snippet, one can get indices using the .index or .get_loc methods in a loop.

** However, is there a fast (non-loopy?) way to take a column of arbitrary datetimes and find their respective indices? **

For example, inputing the following column of datetimes:

2013-01-01 11:00:00
2013-01-01 11:00:00
2013-01-01 00:00:00
2013-12-30 18:00:00

should output the following indices: [11, 11, 0, 8730]

Upvotes: 1

Views: 529

Answers (3)

Bob Baxley
Bob Baxley

Reputation: 3761

Use isin:

time_series[time_series.index.isin(['2013-01-01 11:00:00',
                                    '2013-01-01 00:00:00',
                                    '2013-12-30 18:00:00'])].values

# Returns: array([   0,   11, 8730])

between and between_time are also useful

Upvotes: 0

cataclysmic
cataclysmic

Reputation: 337

Thank you for the responses. I have a new, faster solution that takes advantage of the fact that pandas supports datetime and timedelta formats. It turns out that the following is roughly twice as fast as Colin's solution above (although not as flexible), and it avoids the overhead of building a Series of ordered datetimes:

all_indices = (df['mydatetimes'] - datetime(2013,1,1,0)) / np.timedelta64(1,'h') 

where df is the pandas dataframe and 'mydatetimes' is the column name that includes the datetimes.

Timing the code yields that this solution performs 30,000 indices in:

0:00:00.009909 --> this snippet

0:00:00.017800 --> Colin's solution with ts=Series(...) and ts.loc. I have excluded the one-time overhead of building a Series from this timing

Upvotes: 0

Colin
Colin

Reputation: 2137

loc can take a list or array of labels to look up:

>>> print time_series.loc[[pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101'), pd.Timestamp('20131230 18:00')]]
2013-01-01 11:00:00      11
2013-01-01 11:00:00      11
2013-01-01 00:00:00       0
2013-12-30 18:00:00    8730
dtype: int64

Upvotes: 2

Related Questions