Mark Horvath
Mark Horvath

Reputation: 1166

Fast selection of a Timestamp range in hierarchically indexed pandas data in Python

Having a DataFrame with tz-aware DatetimeIndex the below is a fast way of selecting multiple rows between two dates for left inclusive, right exclusive intervals:

import pandas as pd
start_ts = pd.Timestamp('20000101 12:00 UTC')
end_ts = pd.Timestamp('20000102 12:00 UTC')
ix_df = pd.DataFrame(0, index=[pd.Timestamp('20000101 00:00 UTC'), pd.Timestamp('20000102 00:00 UTC')], columns=['a'])
EPSILON_TIME = pd.tseries.offsets.Nano()
ix_df[start_ts:end_ts-EPSILON_TIME]

The above solution is fairly efficient as we do not create a temporary indexing iterable as I'll do later, nor do we run a lambda expression in Python to create the new data frame. In fact I believe the selection is around O(log(N)) at most. I wonder if this is also possible on a particular axis of a MultiIndex, or I have to create a temporary iterable or run a lambda expressions. E.g.:

mux = pd.MultiIndex.from_arrays([[pd.Timestamp('20000102 00:00 UTC'), pd.Timestamp('20000103 00:00 UTC')], [pd.Timestamp('20000101 00:00 UTC'), pd.Timestamp('20000102 00:00 UTC')]])
mux_df = pd.DataFrame(0, index=mux, columns=['a'])

Then I can select on the first (zeroth) level of the index on the same way:

mux_df[start_ts:end_ts-EPSILON_TIME]

which yields:

                                                     a
2000-01-02 00:00:00+00:00 2000-01-01 00:00:00+00:00  0

but for the second level I have to chose a slow solution:

values_itr = mux_df.index.get_level_values(1)
mask_ser = (values_itr >= start_ts) & (values_itr < end_ts)
mux_df[mask_ser]

yielding correctly:

                                                     a
2000-01-03 00:00:00+00:00 2000-01-02 00:00:00+00:00  0

Any fast workarounds? Thanks!

Edit: Chosen Approach

Ended up with this solution after all, when having realized I also need slicing:

def view(data_df):
    if len(data_df.index) == 0:
        return data_df
    values_itr = data_df.index.get_level_values(0)
    values_itr = values_itr.values
    from_i = np.searchsorted(values_itr, np.datetime64(start_ts), side='left')
    to_i = np.searchsorted(values_itr, np.datetime64(end_ts), side='left')
    return data_df.ix[from_i:to_i]

Then do view(data_df).copy(). Note: my values in the first level of the index are in fact sorted.

Upvotes: 1

Views: 1239

Answers (1)

Jeff
Jeff

Reputation: 128948

Well you are actually comparing apples to oranges here.

In [59]: N = 1000000

In [60]: pd.set_option('max_rows',10)

In [61]: idx = pd.IndexSlice

In [62]: df = DataFrame(np.arange(N).reshape(-1,1),columns=['value'],index=pd.MultiIndex.from_product([list('abcdefghij'),date_range('20010101',periods=N/10,freq='T',tz='US/Eastern')],names=['one','two']))

In [63]: df
Out[63]: 
                                value
one two                              
a   2001-01-01 00:00:00-05:00       0
    2001-01-01 00:01:00-05:00       1
    2001-01-01 00:02:00-05:00       2
    2001-01-01 00:03:00-05:00       3
    2001-01-01 00:04:00-05:00       4
...                               ...
j   2001-03-11 10:35:00-05:00  999995
    2001-03-11 10:36:00-05:00  999996
    2001-03-11 10:37:00-05:00  999997
    2001-03-11 10:38:00-05:00  999998
    2001-03-11 10:39:00-05:00  999999

[1000000 rows x 1 columns]

In [64]: df2 = df.reset_index(level='one').sort_index()
df
In [65]: df2
Out[65]: 
                          one   value
two                                  
2001-01-01 00:00:00-05:00   a       0
2001-01-01 00:00:00-05:00   i  800000
2001-01-01 00:00:00-05:00   h  700000
2001-01-01 00:00:00-05:00   g  600000
2001-01-01 00:00:00-05:00   f  500000
...                        ..     ...
2001-03-11 10:39:00-05:00   c  299999
2001-03-11 10:39:00-05:00   b  199999
2001-03-11 10:39:00-05:00   a   99999
2001-03-11 10:39:00-05:00   i  899999
2001-03-11 10:39:00-05:00   j  999999

[1000000 rows x 2 columns]

When I reset the index (iow create a single-level index), it IS NOT LONGER UNIQUE. This makes a big difference, because it searches diffently. So you cannot really compare indexing on a single-level unique index vs a multi-level.

Turns out using the multi-index slicers (introduced in 0.14.0). Makes indexing pretty fast on any level.

In [66]: %timeit df.loc[idx[:,'20010201':'20010301'],:]
1 loops, best of 3: 188 ms per loop

In [67]: df.loc[idx[:,'20010201':'20010301'],:]
Out[67]: 
                                value
one two                              
a   2001-02-01 00:00:00-05:00   44640
    2001-02-01 00:01:00-05:00   44641
    2001-02-01 00:02:00-05:00   44642
    2001-02-01 00:03:00-05:00   44643
    2001-02-01 00:04:00-05:00   44644
...                               ...
j   2001-03-01 23:55:00-05:00  986395
    2001-03-01 23:56:00-05:00  986396
    2001-03-01 23:57:00-05:00  986397
    2001-03-01 23:58:00-05:00  986398
    2001-03-01 23:59:00-05:00  986399

[417600 rows x 1 columns]

Compare this with a non-unique single-level

In [68]: %timeit df2.loc['20010201':'20010301']
1 loops, best of 3: 470 ms per loop

Here is a UNIQUE single-level

In [73]: df3 = DataFrame(np.arange(N).reshape(-1,1),columns=['value'],index=date_range('20010101',periods=N,freq='T',tz='US/Eastern'))

In [74]: df3
Out[74]: 
                            value
2001-01-01 00:00:00-05:00       0
2001-01-01 00:01:00-05:00       1
2001-01-01 00:02:00-05:00       2
2001-01-01 00:03:00-05:00       3
2001-01-01 00:04:00-05:00       4
...                           ...
2002-11-26 10:35:00-05:00  999995
2002-11-26 10:36:00-05:00  999996
2002-11-26 10:37:00-05:00  999997
2002-11-26 10:38:00-05:00  999998
2002-11-26 10:39:00-05:00  999999

[1000000 rows x 1 columns]

In [75]: df3.loc['20010201':'20010301']
Out[75]: 
                           value
2001-02-01 00:00:00-05:00  44640
2001-02-01 00:01:00-05:00  44641
2001-02-01 00:02:00-05:00  44642
2001-02-01 00:03:00-05:00  44643
2001-02-01 00:04:00-05:00  44644
...                          ...
2001-03-01 23:55:00-05:00  86395
2001-03-01 23:56:00-05:00  86396
2001-03-01 23:57:00-05:00  86397
2001-03-01 23:58:00-05:00  86398
2001-03-01 23:59:00-05:00  86399

[41760 rows x 1 columns]

Fastest so far

In [76]: %timeit df3.loc['20010201':'20010301']
1 loops, best of 3: 294 ms per loop

Best is a single-level UNIQUE without a timezone

In [77]: df3 = DataFrame(np.arange(N).reshape(-1,1),columns=['value'],index=date_range('20010101',periods=N,freq='T'))

In [78]: %timeit df3.loc['20010201':'20010301']
1 loops, best of 3: 240 ms per loop

And by far the fastest method (I am doing a slightly different search here to get the same results, as the semantics of the above searches include all dates on the specified dates)

In [101]: df4 = df3.reset_index()

In [103]: %timeit df4.loc[(df4['index']>='20010201') & (df4['index']<'20010302')]
100 loops, best of 3: 10.6 ms per loop

In [104]:  df4.loc[(df4['index']>='20010201') & (df4['index']<'20010302')]
Out[104]: 
                    index  value
44640 2001-02-01 00:00:00  44640
44641 2001-02-01 00:01:00  44641
44642 2001-02-01 00:02:00  44642
44643 2001-02-01 00:03:00  44643
44644 2001-02-01 00:04:00  44644
...                   ...    ...
86395 2001-03-01 23:55:00  86395
86396 2001-03-01 23:56:00  86396
86397 2001-03-01 23:57:00  86397
86398 2001-03-01 23:58:00  86398
86399 2001-03-01 23:59:00  86399

[41760 rows x 2 columns]

Ok, so why is the 4th method the fastest. It constructs a boolean indexing array then uses nonzero, so pretty fast. The first three methods use searchsorted (twice) after already determining that the index is unique and monotonic, to figure out the endpoints, so you have multiple things going on.

Bottom line, boolean indexing is pretty fast, so use it! (results may different and the first 3 methods may become faster depending on WHAT you are selecting, e.g. a smaller selection range may have different performance characteristics).

Upvotes: 3

Related Questions