Reputation: 81
I have a Pandas DataFrame like this one:
import numpy as np
import pandas as pd
np.random.seed(1234)
midx = pd.MultiIndex.from_product([['a', 'b', 'c'], pd.date_range('20130101', periods=6)], names=['letter', 'date'])
df = pd.DataFrame(np.random.randn(len(midx), 1), index=midx)
That dataframe looks like this:
0
letter date
a 2013-01-01 0.471435
2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
2013-01-06 0.887163
b 2013-01-01 0.859588
2013-01-02 -0.636524
2013-01-03 0.015696
2013-01-04 -2.242685
2013-01-05 1.150036
2013-01-06 0.991946
c 2013-01-01 0.953324
2013-01-02 -2.021255
2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
2013-01-06 0.289092
What I want to do is to keep all rows based on a condition on date which depends on the letter. For instance,
All this information could be stored in a dictionary for instance.
dictionary = {"a": slice("20130102", "20130105"),
"b": "20130103",
"c": slice("20130103", "20130105")}
Is there an easy way to compute this with pandas? I did not find any information about such filtering.
Upvotes: 7
Views: 2170
Reputation: 76917
You could use query
, it is designed for this kind of selection criteria.
If you slightly modify your dictionary
you can generate your desired query with the help of a list comprehension:
In : dictionary
Out:
{'a': ('20130102', '20130105'),
'b': ('20130103', '20130103'),
'c': ('20130103', '20130105')}
In : df.query(
' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,)))
for k, v in dictionary.items())
)
Out:
0
letter date
a 2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
b 2013-01-03 0.015696
c 2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
For more information on what the query statement is actually doing, here's details on the list comprehension:
In : (' or '.join("('{}' <= date <= '{}' and letter == '{}')".format(*(v + (k,)))
for k, v in dictionary.items()))
Out: "('20130102' <= date <= '20130105' and letter == 'a') or
('20130103' <= date <= '20130105' and letter == 'c') or
('20130103' <= date <= '20130103' and letter == 'b')"
Upvotes: 6
Reputation: 111
The easiest way is to apply a function to a pandas DataFrameGroupBy object, this is an example:
dictionary = {"a": slice("20130102", "20130105"),
"b": slice("20130103", "20130103"),
"c": slice("20130103", "20130105")}
def date_condition(group, dictionary):
return group.xs(group.name).loc[dictionary[group.name]]
df.groupby(level=0).apply(date_condition, dictionary)
Output[0]:
0
letter date
a 2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
b 2013-01-03 0.015696
c 2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
Note the slice for "b" repeats the date to enforce that .loc
returns a DataFrame instead of a Series
Upvotes: 1
Reputation: 4744
With a small change to the original dictionary we can do this a little more succinctly. We can use pd.IndexSlice
in a list comprehension and then pd.concat
,
# add `-` to separate dates
dictionary = {"a": slice("2013-01-02", "2013-01-05"),
"b": "2013-01-03",
"c": slice("2013-01-03", "2013-01-05")}
dictionary = OrderedDict(sorted(dictionary.items()))
idx_slices = [pd.IndexSlice[k, v] for k, v in dictionary.items()]
pd.concat([df.loc[idx, :] for idx in idx_slices])
Out[1]:
0
letter date
a 2013-01-02 -1.190976
2013-01-03 1.432707
2013-01-04 -0.312652
2013-01-05 -0.720589
c 2013-01-03 -0.334077
2013-01-04 0.002118
2013-01-05 0.405453
b 2013-01-03 0.015696
If you wished to automate adding -
you could use datetime
as follows,
dt.datetime.strptime('20170121', '%Y%m%d').strftime('%Y-%m-%d')
Upvotes: 1
Reputation: 40878
This is a kludgy way about this, but you could use the fact that
passing a list of labels or tuples works similar to reindexing [source]
and take advantage of pd.Index.slice_indexer(start, stop)
, which lets you filter each index to between the specified dates.
>>> dictionary = {"a": ("20130102", "20130105"),
... "b": "20130103",
... "c": ("20130103", "20130105")}
...
...
... def get_idx_pairs():
... for lvl0, lvl1 in df.index.groupby(df.index.get_level_values(0)).items():
... dates = lvl1.levels[1]
... dt = dictionary[lvl0]
... if isinstance(dt, (tuple, list)):
... slices = dates[dates.slice_indexer(dt[0], dt[1])]
... for s in slices:
... yield (lvl0, s)
... else:
... yield (lvl0, dt)
...
...
... df.loc[list(get_idx_pairs())]
...
0
letter date
a 2013-01-02 -1.1910
2013-01-03 1.4327
2013-01-04 -0.3127
2013-01-05 -0.7206
b 2013-01-03 0.0157
c 2013-01-03 -0.3341
2013-01-04 0.0021
2013-01-05 0.4055
For each "smaller" DatetimeIndex in date
, you constrain that to the slices specified, then construct tuples of (letter, date)
on which to index explicitly.
Alternatively, if you're okay with specifying the dates as tuples (for single dates, just repeat), you can condense the helper function a bit:
>>> dates = (("20130102", "20130105"),
... ("20130103", "20130103"),
... ("20130103", "20130105"))
...
... def get_idx_pairs(df, dates):
... letters = df.index.get_level_values(0)
... for (k, v), (start, stop) in zip(df.index.groupby(letters).items(), dates):
... dates = v.levels[1]
... sliced = dates[dates.slice_indexer(start, stop)]
... for s in sliced:
... yield k, s
...
... df.loc[list(get_idx_pairs(df, dates))]
...
0
letter date
a 2013-01-02 -1.1910
2013-01-03 1.4327
2013-01-04 -0.3127
2013-01-05 -0.7206
b 2013-01-03 0.0157
c 2013-01-03 -0.3341
2013-01-04 0.0021
2013-01-05 0.4055
Upvotes: 2