Reputation: 40889
I have a Pandas dataframe with a DataTimeIndex and some other columns, similar to this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='6H')
df = pd.DataFrame(index = range)
# Average speed in miles per hour
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
df.info()
# DatetimeIndex: 141 entries, 2017-12-01 00:00:00 to 2018-01-05 00:00:00
# Freq: 6H
# Data columns (total 1 columns):
# value 141 non-null int64
# dtypes: int64(1)
# memory usage: 2.2 KB
df.head(10)
# value
# 2017-12-01 00:00:00 15
# 2017-12-01 06:00:00 54
# 2017-12-01 12:00:00 19
# 2017-12-01 18:00:00 13
# 2017-12-02 00:00:00 35
# 2017-12-02 06:00:00 31
# 2017-12-02 12:00:00 58
# 2017-12-02 18:00:00 6
# 2017-12-03 00:00:00 8
# 2017-12-03 06:00:00 30
How can I select or filter the entries that are:
Weekdays only (that is, not weekend days Saturday or Sunday)
Not within N days of the dates in a list (e.g. U.S. holidays like '12-25' or '01-01')?
I was hoping for something like:
df = exclude_Sat_and_Sun(df)
omit_days = ['12-25', '01-01']
N = 3 # days near the holidays
df = exclude_days_near_omit_days(N, omit_days)
I was thinking of creating a new column to break out the month and day and then comparing them to the criteria for 1 and 2 above. However, I was hoping for something more Pythonic using the DateTimeIndex.
Thanks for any help.
Upvotes: 7
Views: 13804
Reputation: 40889
I followed the answer by @Bahman Engheta and created a function to omit dates from a dataframe.
import pandas as pd
from datetime import datetime, timedelta
def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
'''
Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
near a given list of dates and/or a date on a weekend.
Parameters:
----------
df : Pandas dataframe
list_years : list of str
Contains a list of years in string form
list_dates : list of str
Contains a list of dates in string form encoded as MM-DD
omit_days_near : int
Threshold of days away from list_dates to remove. For example, if
omit_days_near=3, then omit all days that are 3 days away from
any date in list_dates.
omit_weekends : bool
If true, omit dates that are on weekends.
Returns:
-------
Pandas dataframe
New resulting dataframe with dates omitted.
'''
if not isinstance(df, pd.core.frame.DataFrame):
raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)
if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
type(df.index).__name__)
if not isinstance(list_years, list):
list_years = [list_years]
if not isinstance(list_dates, list):
list_dates = [list_dates]
result = df.copy()
if omit_weekends:
result = result.loc[result.index.dayofweek < 5]
omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]
for date in omit_dates:
result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]
return result
Here is example usage. Suppose you have a dataframe that has a DateTimeIndex and other columns, like this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
The resulting dataframe looks like this:
value
2017-12-01 42
2017-12-02 35
2017-12-03 49
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-08 57
2017-12-09 3
2017-12-10 57
2017-12-11 46
2017-12-12 20
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-16 57
2017-12-17 4
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-23 45
2017-12-24 34
2017-12-25 42
2017-12-26 33
2017-12-27 17
2017-12-28 2
2017-12-29 2
2017-12-30 51
2017-12-31 19
2018-01-01 6
2018-01-02 43
2018-01-03 11
2018-01-04 45
2018-01-05 45
Now, let's specify dates to remove. I want to remove the dates '12-10', '12-25', '12-31', and '01-01' (following MM-DD notation) and all dates within 2 days of those dates. Further, I want to remove those dates from both the years '2016' and '2017'. I also want to remove weekend dates.
I'll call my function like this:
years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)
The result is:
value
2017-12-01 42
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-28 2
2018-01-03 11
2018-01-04 45
2018-01-05 45
Is that answer correct? Here are the calendars for December 2017 and January 2018:
December 2017
Su Mo Tu We Th Fr Sa
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
January 2018
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
Looks like it works.
Upvotes: 2
Reputation: 116
The first part can be easily accomplished using the Pandas DatetimeIndex.dayofweek
property, which starts counting weekdays with Monday as 0 and ending with Sunday as 6.
df[df.index.dayofweek < 5]
will give you only the weekdays.
For the second part you can use the datetime
module. Below I will give an example for only one date, namely 2017-12-25. You can easily generalize it to a list of dates, for example by defining a helper function.
from datetime import datetime, timedelta
N = 3
df[abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N)]
This will give all dates that are more than N=3
days away from 2017-12-25. That is, it will exclude an interval of 7 days from 2017-12-22 to 2017-12-28.
Lastly, you can combine the two criteria using the &
operator, as you probably know.
df[
(df.index.dayofweek < 5)
&
(abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N))
]
Upvotes: 7