Reputation: 81

Filtering with MultiIndex

I have a Pandas DataFrame like this one:

import numpy as np
import pandas as pd

np.random.seed(1234)
midx = pd.MultiIndex.from_product([['a', 'b', 'c'], pd.date_range('20130101', periods=6)], names=['letter', 'date'])
df = pd.DataFrame(np.random.randn(len(midx), 1), index=midx)

That dataframe looks like this:

                        0
letter      date    
  a     2013-01-01  0.471435
        2013-01-02  -1.190976
        2013-01-03  1.432707
        2013-01-04  -0.312652
        2013-01-05  -0.720589
        2013-01-06  0.887163
  b     2013-01-01  0.859588
        2013-01-02  -0.636524
        2013-01-03  0.015696
        2013-01-04  -2.242685
        2013-01-05  1.150036
        2013-01-06  0.991946
  c     2013-01-01  0.953324
        2013-01-02  -2.021255
        2013-01-03  -0.334077
        2013-01-04  0.002118
        2013-01-05  0.405453
        2013-01-06  0.289092

What I want to do is to keep all rows based on a condition on date which depends on the letter. For instance,

for letter a, I want to keep all rows such that date is between "20130102" and "20130105" (included)
for letter b, I want to keep all rows such that date == "20130103"
for letter c, I want to keep all rows such that date is between "20130103" and "20130105" (included)

All this information could be stored in a dictionary for instance.

dictionary = {"a": slice("20130102", "20130105"),
              "b": "20130103",
              "c": slice("20130103", "20130105")}

Is there an easy way to compute this with pandas? I did not find any information about such filtering.

Upvotes: 7

Answers (4)

Zero

Reputation: 77027

You could use query, it is designed for this kind of selection criteria.

If you slightly modify your dictionary you can generate your desired query with the help of a list comprehension:

In : dictionary
Out:
{'a': ('20130102', '20130105'),
 'b': ('20130103', '20130103'),
 'c': ('20130103', '20130105')}

In : df.query(
          ' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,))) 
          for k, v in dictionary.items())
         )
Out:
                          0
letter date
a      2013-01-02 -1.190976
       2013-01-03  1.432707
       2013-01-04 -0.312652
       2013-01-05 -0.720589
b      2013-01-03  0.015696
c      2013-01-03 -0.334077
       2013-01-04  0.002118
       2013-01-05  0.405453

For more information on what the query statement is actually doing, here's details on the list comprehension:

In : (' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,)))
          for k, v in dictionary.items()))
Out: "('20130102' <= date <= '20130105' and letter == 'a') or 
          ('20130103' <= date <= '20130105' and letter == 'c') or
          ('20130103' <= date <= '20130103' and letter == 'b')"

Upvotes: 6

Gustavo Vera Velasco

Reputation: 111

The easiest way is to apply a function to a pandas DataFrameGroupBy object, this is an example:

dictionary = {"a": slice("20130102", "20130105"),
              "b": slice("20130103", "20130103"),
              "c": slice("20130103", "20130105")}

def date_condition(group, dictionary):
    return group.xs(group.name).loc[dictionary[group.name]]

df.groupby(level=0).apply(date_condition, dictionary)

Output[0]:
                          0
letter date                
a      2013-01-02 -1.190976
       2013-01-03  1.432707
       2013-01-04 -0.312652
       2013-01-05 -0.720589
b      2013-01-03  0.015696
c      2013-01-03 -0.334077
       2013-01-04  0.002118
       2013-01-05  0.405453

Note the slice for "b" repeats the date to enforce that .loc returns a DataFrame instead of a Series

Upvotes: 1

Little Bobby Tables

Reputation: 4742

With a small change to the original dictionary we can do this a little more succinctly. We can use pd.IndexSlice in a list comprehension and then pd.concat,

# add `-` to separate dates
dictionary = {"a": slice("2013-01-02", "2013-01-05"),
              "b": "2013-01-03",
              "c": slice("2013-01-03", "2013-01-05")}

dictionary = OrderedDict(sorted(dictionary.items()))

idx_slices = [pd.IndexSlice[k, v] for k, v in dictionary.items()]

pd.concat([df.loc[idx, :] for idx in idx_slices])

Out[1]:
                     0
letter  date    
a       2013-01-02   -1.190976
        2013-01-03   1.432707
        2013-01-04   -0.312652
        2013-01-05   -0.720589
c       2013-01-03   -0.334077
        2013-01-04   0.002118
        2013-01-05   0.405453
b       2013-01-03   0.015696

If you wished to automate adding - you could use datetime as follows,

dt.datetime.strptime('20170121', '%Y%m%d').strftime('%Y-%m-%d')

Upvotes: 1

Brad Solomon

Reputation: 40938

This is a kludgy way about this, but you could use the fact that

passing a list of labels or tuples works similar to reindexing [source]

and take advantage of pd.Index.slice_indexer(start, stop), which lets you filter each index to between the specified dates.

>>> dictionary = {"a": ("20130102", "20130105"),
...               "b": "20130103",
...               "c": ("20130103", "20130105")}
... 
... 
... def get_idx_pairs():
...     for lvl0, lvl1 in df.index.groupby(df.index.get_level_values(0)).items():
...         dates = lvl1.levels[1]
...         dt = dictionary[lvl0]
...         if isinstance(dt, (tuple, list)):
...             slices = dates[dates.slice_indexer(dt[0], dt[1])]
...             for s in slices:
...                 yield (lvl0, s)
...         else:
...             yield (lvl0, dt)
... 
... 
... df.loc[list(get_idx_pairs())]
... 
                        0
letter date              
a      2013-01-02 -1.1910
       2013-01-03  1.4327
       2013-01-04 -0.3127
       2013-01-05 -0.7206
b      2013-01-03  0.0157
c      2013-01-03 -0.3341
       2013-01-04  0.0021
       2013-01-05  0.4055

For each "smaller" DatetimeIndex in date, you constrain that to the slices specified, then construct tuples of (letter, date) on which to index explicitly.

Alternatively, if you're okay with specifying the dates as tuples (for single dates, just repeat), you can condense the helper function a bit:

>>> dates = (("20130102", "20130105"),
...          ("20130103", "20130103"),
...          ("20130103", "20130105"))
... 
... def get_idx_pairs(df, dates):
...     letters = df.index.get_level_values(0)
...     for (k, v), (start, stop) in zip(df.index.groupby(letters).items(), dates):
...         dates = v.levels[1]
...         sliced = dates[dates.slice_indexer(start, stop)]
...         for s in sliced:
...             yield k, s
... 
... df.loc[list(get_idx_pairs(df, dates))]
... 
                        0
letter date              
a      2013-01-02 -1.1910
       2013-01-03  1.4327
       2013-01-04 -0.3127
       2013-01-05 -0.7206
b      2013-01-03  0.0157
c      2013-01-03 -0.3341
       2013-01-04  0.0021
       2013-01-05  0.4055

Upvotes: 2

Filtering with MultiIndex

Answers (4)

Related Questions