Reputation: 17017
I have a multiindexed pandas dataframe sort of like this:
data = np.random.random((1800,9))
col = pd.MultiIndex.from_product([('A','B','C'),('a','b','c')])
year = range(2006,2011)
month = range(1,13)
day = range(1,31)
idx = pd.MultiIndex.from_product([year,month,day], names=['Year','Month','Day'])
df1 = pd.DataFrame(data, idx, col)
Which has multiindexed rows of Year
, Month
, Day
. I want to be able to select rows from this Dataframe as if it were one that has a DatetimeIndex.
The equivalent DataFrame with a DatetimeIndex would be:
idx = pd.DatetimeIndex(start='2006-01-01', end='2010-12-31', freq='d')
timeidx = [ix for ix in idx if ix.day < 29]
df2 = pd.DataFrame(data, timeidx, col)
What I would like is this:
all(df2.ix['2006-06-06':'2008-10-11'] == df1'insert expression here')
to equal True
I know I can select cross-sections via df1.xs('2006', level='Year')
, but I basically need an easy way to replicate what was done for df2
as I am forced to use this index as opposed to the DatetimeIndex.
Upvotes: 1
Views: 111
Reputation: 375865
One issue you'll immediately have by storing these as strings is '2' > '10'
, which is almost certainly not what you want, so I recommend using ints. That is:
year = range(2006,2011)
month = range(1,13)
day = range(1,31)
I though that you ought to be able to use pd.IndexSlice here, my first thought was to use it as follows:
In [11]: idx = pd.IndexSlice
In [12]: df1.loc[idx[2006:2008, 6:10, 6:11], :]
...
but this shows those between 2006-8 and june-oct and 6-11th (ie 3*5*6 = 90 days).
So here's a non-vectorized way, just compare the tuples:
In [21]: df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))
Out[21]: array([False, False, False, ..., False, False, False], dtype=bool)
In [22]: df1[df1.index.map(lambda x: (2006, 6, 6) < x < (2008, 10, 11))]
# just the (844) rows you want
If this was unbearably slow, a trick (to vectorize) would be to use some float representation, for example:
In [31]: df1.index.get_level_values(0).values + df1.index.get_level_values(1).values * 1e-3 + df1.index.get_level_values(2).values * 1e-6
Out[31]:
array([ 2006.001001, 2006.001002, 2006.001003, ..., 2010.012028,
2010.012029, 2010.01203 ])
Upvotes: 1