MTANG
MTANG

Reputation: 516

pandas get data for the end day of month?

The data is given as following:

             return 
2010-01-04  0.016676    
2010-01-05  0.003839
...
2010-01-05  0.003839
2010-01-29  0.001248
2010-02-01  0.000134
...

What I want get is to extract all value that is the last day of month appeared in the data .

2010-01-29  0.00134
2010-02-28  ......

If I directly use pandas.resample, i.e., df.resample('M).last(). I would select the correct rows with the wrong index. (it automatically use the last day of the month as the index)

2010-01-31  0.00134
2010-02-28  ......

How can I get the correct answer in a Pythonic way?

Upvotes: 9

Views: 8272

Answers (2)

cs95
cs95

Reputation: 403218

An assumption made here is that your date data is part of the index. If not, I recommend setting it first.

Single Year

I don't think the resampling or grouper functions would do. Let's group on the month number instead and call DataFrameGroupBy.tail.

df.groupby(df.index.month).tail(1) 

Multiple Years

If your data spans multiple years, you'll need to group on the year and month. Using a single grouper created from dt.strftime

df.groupby(df.index.strftime('%Y-%m')).tail(1)

Or, using multiple groupers—

df.groupby([df.index.year, df.index.month]).tail(1)

Note—if your index is not a DatetimeIndex as assumed here, you'll need to replace df.index with pd.to_datetime(df.index, errors='coerce') above.

Upvotes: 15

Anton vBR
Anton vBR

Reputation: 18914

Although this doesn't answer the question properly I'll leave it if someone is interested.

An approach which would only work if you are certain you have all days (!IMPORTANT) is to add 1 day too with pd.Timedelta and check if day == 1. I did a small running time test and it is 6x faster than the groupby solution.

df[(df['dates'] + pd.Timedelta(days=1)).dt.day == 1]

Or if index:

df[(df.index + pd.Timedelta(days=1)).day == 1]

Full example:

import pandas as pd

df = pd.DataFrame({
    'dates': pd.date_range(start='2016-01-01', end='2017-12-31'),
    'i': 1
}).set_index('dates')

dfout = df[(df.index + pd.Timedelta(days=1)).day == 1]
print(dfout)

Returns:

           i
dates        
2016-01-31  1
2016-02-29  1
2016-03-31  1
2016-04-30  1
2016-05-31  1
2016-06-30  1
2016-07-31  1
2016-08-31  1
2016-09-30  1
2016-10-31  1
2016-11-30  1
2016-12-31  1
2017-01-31  1
2017-02-28  1
2017-03-31  1
2017-04-30  1
2017-05-31  1
2017-06-30  1
2017-07-31  1
2017-08-31  1
2017-09-30  1
2017-10-31  1
2017-11-30  1
2017-12-31  1

Upvotes: 6

Related Questions