Tahnoon Pasha
Tahnoon Pasha

Reputation: 6018

Pandas multi-index best way to slice for ranges of subordinate levels

I have two questions concerning pandas dataframe multi-indices.

Assume I have a data-frame df as follows:

                                   data
    port                           bm  pf
    sector instrument date
    1      A          2013-01-14    0   0
                      2013-01-15    5   5
                      2013-01-16   10  10
                      2013-01-17   15  15
                      2013-01-18   20  20

Which can be generated with the following code:

import pandas as pd
date = pd.bdate_range('2013-01-14','2013-01-20').repeat(5)
sector = [1,1,1,2,2] * 5
df = pd.DataFrame(dict(port=['pf']*25,sector=sector,instrument=list('ABCDE')*5,date=date,data=xrange(25)))
df = pd.concat([df,pd.DataFrame(dict(port=['bm']*25,sector=sector,instrument=list('ABCDE')*5,date=date,data=xrange(25)))],axis=0)
df = df.set_index(['port','sector','instrument','date'])
df = df.unstack('port')

I want to get two sets of results: all the values on 2013-01-17 and all the values from 2013-01-17 to the end of the series.

For the first I know I can use one of the following approaches:

idx = pd.IndexSlice
targetdate = pd.Timestamp('2013-01-17')
slicer = (slice(None),slice(None),targetdate)

1) df.loc[slicer,:]

2) df.xs(pd.Timestamp('2013-01-17'),level=2)

3) df.xs(slicer,)

4) df[idx[:,:,targetdate],:]

all of which seem quite clunky. Is there a more obvious way I'm missing? What other ways are there to acheive this. I guess I'm hoping there is something like df.loc(level=2)[targetdate] (which doesn't work of course).

For the second I've only come up with one solution

query = df.index.get_level_values(2) >= pd.Timestamp('2013-01-17')
df[query]

Again is there a more efficient way to do this?

Final bonus question: what does df.index.get_loc_level() do? I feel like it should help with this but I have no idea how to use it.

Thanks

Upvotes: 3

Views: 2025

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375865

I think this masking, like you're doing, is going to be pretty good here:

query = df.index.get_level_values(2) >= pd.Timestamp('2013-01-17')
df[query]

if you have lots of repetition in dates you may improve performance with something lower-level:

query = (df.index.levels[2] >= pd.Timestamp("2013-01-17"))[df.index.labels[2]]
df[query]

I'll probably get shouted out for that...! But it will be significantly faster in some cases.


get_loc_level is the similar of loc, i.e. label based rather than by position:

Get integer location slice for requested label or tuple

In [21]: df.index.get_loc_level(2)
Out[21]:
(slice(15, 25, None),
 MultiIndex(levels=[[u'A', u'B', u'C', u'D', u'E'], [2013-01-14 00:00:00, 2013-01-15 00:00:00, 2013-01-16 00:00:00, 2013-01-17 00:00:00, 2013-01-18 00:00:00]],
            labels=[[3, 3, 3, 3, 3, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
            names=[u'instrument', u'date']))

by default it takes the first index, but you can pass in more...

In [21]: df.index.get_loc_level((1, "A"))
Out[21]:
(slice(0, 5, None), <class 'pandas.tseries.index.DatetimeIndex'>
 [2013-01-14, ..., 2013-01-18]
 Length: 5, Freq: None, Timezone: None)

Upvotes: 5

Related Questions