johann.faouzi
johann.faouzi

Reputation: 81

Filtering with MultiIndex

I have a Pandas DataFrame like this one:

import numpy as np
import pandas as pd

np.random.seed(1234)
midx = pd.MultiIndex.from_product([['a', 'b', 'c'], pd.date_range('20130101', periods=6)], names=['letter', 'date'])
df = pd.DataFrame(np.random.randn(len(midx), 1), index=midx)

That dataframe looks like this:

                        0
letter      date    
  a     2013-01-01  0.471435
        2013-01-02  -1.190976
        2013-01-03  1.432707
        2013-01-04  -0.312652
        2013-01-05  -0.720589
        2013-01-06  0.887163
  b     2013-01-01  0.859588
        2013-01-02  -0.636524
        2013-01-03  0.015696
        2013-01-04  -2.242685
        2013-01-05  1.150036
        2013-01-06  0.991946
  c     2013-01-01  0.953324
        2013-01-02  -2.021255
        2013-01-03  -0.334077
        2013-01-04  0.002118
        2013-01-05  0.405453
        2013-01-06  0.289092

What I want to do is to keep all rows based on a condition on date which depends on the letter. For instance,

All this information could be stored in a dictionary for instance.

dictionary = {"a": slice("20130102", "20130105"),
              "b": "20130103",
              "c": slice("20130103", "20130105")}

Is there an easy way to compute this with pandas? I did not find any information about such filtering.

Upvotes: 7

Views: 2170

Answers (4)

Zero
Zero

Reputation: 76917

You could use query, it is designed for this kind of selection criteria.

If you slightly modify your dictionary you can generate your desired query with the help of a list comprehension:

In : dictionary
Out:
{'a': ('20130102', '20130105'),
 'b': ('20130103', '20130103'),
 'c': ('20130103', '20130105')}

In : df.query(
          ' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,))) 
          for k, v in dictionary.items())
         )
Out:
                          0
letter date
a      2013-01-02 -1.190976
       2013-01-03  1.432707
       2013-01-04 -0.312652
       2013-01-05 -0.720589
b      2013-01-03  0.015696
c      2013-01-03 -0.334077
       2013-01-04  0.002118
       2013-01-05  0.405453

For more information on what the query statement is actually doing, here's details on the list comprehension:

In : (' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,)))
          for k, v in dictionary.items()))
Out: "('20130102' <= date <= '20130105' and letter == 'a') or 
          ('20130103' <= date <= '20130105' and letter == 'c') or
          ('20130103' <= date <= '20130103' and letter == 'b')"

Upvotes: 6

Gustavo Vera Velasco
Gustavo Vera Velasco

Reputation: 111

The easiest way is to apply a function to a pandas DataFrameGroupBy object, this is an example:

dictionary = {"a": slice("20130102", "20130105"),
              "b": slice("20130103", "20130103"),
              "c": slice("20130103", "20130105")}

def date_condition(group, dictionary):
    return group.xs(group.name).loc[dictionary[group.name]]

df.groupby(level=0).apply(date_condition, dictionary)

Output[0]:
                          0
letter date                
a      2013-01-02 -1.190976
       2013-01-03  1.432707
       2013-01-04 -0.312652
       2013-01-05 -0.720589
b      2013-01-03  0.015696
c      2013-01-03 -0.334077
       2013-01-04  0.002118
       2013-01-05  0.405453

Note the slice for "b" repeats the date to enforce that .loc returns a DataFrame instead of a Series

Upvotes: 1

Little Bobby Tables
Little Bobby Tables

Reputation: 4744

With a small change to the original dictionary we can do this a little more succinctly. We can use pd.IndexSlice in a list comprehension and then pd.concat,

# add `-` to separate dates
dictionary = {"a": slice("2013-01-02", "2013-01-05"),
              "b": "2013-01-03",
              "c": slice("2013-01-03", "2013-01-05")}

dictionary = OrderedDict(sorted(dictionary.items()))

idx_slices = [pd.IndexSlice[k, v] for k, v in dictionary.items()]

pd.concat([df.loc[idx, :] for idx in idx_slices])

Out[1]:
                     0
letter  date    
a       2013-01-02   -1.190976
        2013-01-03   1.432707
        2013-01-04   -0.312652
        2013-01-05   -0.720589
c       2013-01-03   -0.334077
        2013-01-04   0.002118
        2013-01-05   0.405453
b       2013-01-03   0.015696

If you wished to automate adding - you could use datetime as follows,

dt.datetime.strptime('20170121', '%Y%m%d').strftime('%Y-%m-%d')

Upvotes: 1

Brad Solomon
Brad Solomon

Reputation: 40878

This is a kludgy way about this, but you could use the fact that

passing a list of labels or tuples works similar to reindexing [source]

and take advantage of pd.Index.slice_indexer(start, stop), which lets you filter each index to between the specified dates.

>>> dictionary = {"a": ("20130102", "20130105"),
...               "b": "20130103",
...               "c": ("20130103", "20130105")}
... 
... 
... def get_idx_pairs():
...     for lvl0, lvl1 in df.index.groupby(df.index.get_level_values(0)).items():
...         dates = lvl1.levels[1]
...         dt = dictionary[lvl0]
...         if isinstance(dt, (tuple, list)):
...             slices = dates[dates.slice_indexer(dt[0], dt[1])]
...             for s in slices:
...                 yield (lvl0, s)
...         else:
...             yield (lvl0, dt)
... 
... 
... df.loc[list(get_idx_pairs())]
... 
                        0
letter date              
a      2013-01-02 -1.1910
       2013-01-03  1.4327
       2013-01-04 -0.3127
       2013-01-05 -0.7206
b      2013-01-03  0.0157
c      2013-01-03 -0.3341
       2013-01-04  0.0021
       2013-01-05  0.4055

For each "smaller" DatetimeIndex in date, you constrain that to the slices specified, then construct tuples of (letter, date) on which to index explicitly.

Alternatively, if you're okay with specifying the dates as tuples (for single dates, just repeat), you can condense the helper function a bit:

>>> dates = (("20130102", "20130105"),
...          ("20130103", "20130103"),
...          ("20130103", "20130105"))
... 
... def get_idx_pairs(df, dates):
...     letters = df.index.get_level_values(0)
...     for (k, v), (start, stop) in zip(df.index.groupby(letters).items(), dates):
...         dates = v.levels[1]
...         sliced = dates[dates.slice_indexer(start, stop)]
...         for s in sliced:
...             yield k, s
... 
... df.loc[list(get_idx_pairs(df, dates))]
... 
                        0
letter date              
a      2013-01-02 -1.1910
       2013-01-03  1.4327
       2013-01-04 -0.3127
       2013-01-05 -0.7206
b      2013-01-03  0.0157
c      2013-01-03 -0.3341
       2013-01-04  0.0021
       2013-01-05  0.4055

Upvotes: 2

Related Questions