Mattijn
Mattijn

Reputation: 13930

Subset Pandas DataFrame based on annual returning period covering multiple months

This question is similar to Selecting Pandas DataFrame records for many years based on month & day range, but both the question and answer doesn't seem to cover my case

import pandas as pd
import numpy as np

rng = pd.date_range('2010-1-1', periods=1000, freq='D')
df = pd.DataFrame(np.random.randn(len(rng)), index=rng, columns=['A'])
df.head()

                   A
2010-01-01  1.098302
2010-01-02 -1.384821
2010-01-03 -0.426329
2010-01-04 -0.587967
2010-01-05 -0.853374

Now I would like to subset my DataFrame based on an annual returning period for every year. A period can for example be defined as from February 15th to October 3rd

startMM, startdd = (2,15)
endMM, enddd = (10,3)

Now I tried to to slice my multi-year DataFrame based on this period:

subset = df[((df.index.month == startMM) & (startdd <= df.index.day) 
             | (df.index.month == endMM) & (df.index.day <= enddd))]

but this returns only the months as is defined in the startMM and endMM and not the actual period between the dates. Any help is kindly appreciated.

subset.index.month.unique()

Int64Index([2, 10], dtype='int64')

Upvotes: 3

Views: 505

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

Alternative solution:

In [79]: x = df.assign(x=df.index.strftime('%m-%d')) \
               .query("'02-15' <= x <= '10-03'").drop('x',1)

In [80]: x
Out[80]:
                   A
2010-02-15 -1.004663
2010-02-16  0.683352
2010-02-17  0.158518
2010-02-18 -0.447414
2010-02-19  0.078998
...              ...
2012-09-22  1.378253
2012-09-23  1.215885
2012-09-24  0.203096
2012-09-25 -1.666974
2012-09-26  0.231987

[687 rows x 1 columns]

In [81]: x.index.month.unique()
Out[81]: Int64Index([2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')

Upvotes: 1

IanS
IanS

Reputation: 16271

I would create a column of (month, day) tuples:

month_day = pd.concat([
                df.index.to_series().dt.month, 
                df.index.to_series().dt.day
            ], axis=1).apply(tuple, axis=1)

You can then compare them directly:

df[(month_day >= (startMM, startdd)) & (month_day <= (endMM, enddd))]

Upvotes: 3

Related Questions