cjm2671
cjm2671

Reputation: 19496

The right way to query a pandas MultiIndex

I've got a huge dataframe (13 million rows) which stocks and stock prices in. I've indexed them using MultiIndex(['stock', 'date']), where date has been parsed as a DateTime.

This means I can select price data based upon stock easily df.loc['AAPL'], and by specific day df.loc['AAPL'].loc['2015-05-05'].

What I want to know is, what is the fastest and most elegant accessor for getting all stock prices on a particular day (i.e. if the indices were reversed).

Currently I'm using df.xs("2015-05-05", level=1) - is that the right way? Is there a better/cleaner/faster way?

Upvotes: 11

Views: 9373

Answers (1)

JohnE
JohnE

Reputation: 30444

I think what you did is fine, but there are alternative ways also.

>>> df = pd.DataFrame({ 
              'stock':np.repeat( ['AAPL','GOOG','YHOO'], 3 ),
              'date':np.tile( pd.date_range('5/5/2015', periods=3, freq='D'), 3 ),
              'price':(np.random.randn(9).cumsum() + 10) })

>>> df = df.set_index(['stock','date'])

                      price
stock date                 
AAPL  2015-05-05   8.538459
      2015-05-06   9.330140
      2015-05-07   8.968898
GOOG  2015-05-05   8.964389
      2015-05-06   9.828230
      2015-05-07   9.992985
YHOO  2015-05-05   9.929548
      2015-05-06   9.330295
      2015-05-07  10.676468

A slightly more standard way of using loc twice

>>> df.loc['AAPL'].loc['2015-05-05']

would be to do

>>> df.loc['AAPL','2015-05-05']

price    8.538459
Name: (AAPL, 2015-05-05 00:00:00), dtype: float64

And instead of xs you could use an IndexSlice. I think for 2 levels xs is easier, but IndexSlice might be better past 2 levels.

>>> idx=pd.IndexSlice

>>> df.loc[ idx[:,'2015-05-05'], : ]

                     price
stock date                
AAPL  2015-05-05  8.538459
GOOG  2015-05-05  8.964389
YHOO  2015-05-05  9.929548

And to be honest, I think the absolute easiest way here is use either date or stock (or neither) as index and then most selections are very straightforward. For example, if you remove the index completely you can effortlessly select by date:

>>> df = df.reset_index()
>>> df[ df['date']=='2015-05-05' ]

   index stock       date      price
0      0  AAPL 2015-05-05   8.538459
3      3  GOOG 2015-05-05   8.964389
6      6  YHOO 2015-05-05   9.929548

Doing some quickie timings with 3 stocks and 3000 dates (=9000 rows), I found that a simple boolean selection (no index) was about 35% faster than xs, and xs was about 35% faster than using IndexSlice. But see Jeff's comment below, you should expect the boolean selection to perform relative worse with more rows.

Of course, the best thing for you to do is test on your own data and see how it comes out.

Upvotes: 6

Related Questions