pbreach
pbreach

Reputation: 17017

Convenient way to slice from multiindexed rows based on range of keys at different levels with pandas dataframe

I have a multiindexed pandas dataframe sort of like this:

data = np.random.random((1800,9))
col = pd.MultiIndex.from_product([('A','B','C'),('a','b','c')])

year = range(2006,2011)
month = range(1,13)
day = range(1,31)

idx = pd.MultiIndex.from_product([year,month,day], names=['Year','Month','Day'])

df1 = pd.DataFrame(data, idx, col)

Which has multiindexed rows of Year, Month, Day. I want to be able to select rows from this Dataframe as if it were one that has a DatetimeIndex.

The equivalent DataFrame with a DatetimeIndex would be:

idx = pd.DatetimeIndex(start='2006-01-01', end='2010-12-31', freq='d')
timeidx = [ix for ix in idx if ix.day < 29]
df2 = pd.DataFrame(data, timeidx, col)

What I would like is this:

all(df2.ix['2006-06-06':'2008-10-11'] == df1'insert expression here')

to equal True

I know I can select cross-sections via df1.xs('2006', level='Year'), but I basically need an easy way to replicate what was done for df2 as I am forced to use this index as opposed to the DatetimeIndex.

Upvotes: 1

Views: 111

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375865

One issue you'll immediately have by storing these as strings is '2' > '10', which is almost certainly not what you want, so I recommend using ints. That is:

year = range(2006,2011)
month = range(1,13)
day = range(1,31)

I though that you ought to be able to use pd.IndexSlice here, my first thought was to use it as follows:

In [11]: idx = pd.IndexSlice

In [12]: df1.loc[idx[2006:2008, 6:10, 6:11], :]
...

but this shows those between 2006-8 and june-oct and 6-11th (ie 3*5*6 = 90 days).


So here's a non-vectorized way, just compare the tuples:

In [21]: df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))
Out[21]: array([False, False, False, ..., False, False, False], dtype=bool)

In [22]: df1[df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))]
# just the (844) rows you want

If this was unbearably slow, a trick (to vectorize) would be to use some float representation, for example:

In [31]: df1.index.get_level_values(0).values + df1.index.get_level_values(1).values * 1e-3 + df1.index.get_level_values(2).values * 1e-6
Out[31]:
array([ 2006.001001,  2006.001002,  2006.001003, ...,  2010.012028,
        2010.012029,  2010.01203 ])

Upvotes: 1

Related Questions