Reputation: 73
I have run into an issue where if there are duplicate dates in both date_x
and date_y
the output of the Holiday column seems to add an instance of the holiday for each appearance of the rows with the same exact dates. Below is the code I am using with a small sample set of data to illustrate my issue.
from pandas.tseries.holiday import USFederalHolidayCalendar
from datetime import datetime
import pandas as pd
cal = USFederalHolidayCalendar()
holidays = (pd.DataFrame(cal.holidays(return_name=True), columns=['Holiday'])
.reset_index()
.rename({'index': 'Date'}, axis=1))
holidays['Date'] = pd.to_datetime(holidays['Date'])
df= pd.DataFrame({'Date_x': {0: '2020-12-22', 1: '2020-06-20', 2: '2020-02-11', 3: '2020-05-22', 4: '2020-12-22', 5: '2020-12-20', 6: '2020-12-20', 7: '2020-12-22'},
'Date_y': {0: '2021-01-01', 1: '2020-07-11', 2: '2020-03-27', 3: '2020-06-27', 4: '2021-01-01', 5: '2020-12-26', 6: '2020-12-27', 7: '2021-01-01'}})
df['Date_x'] = pd.to_datetime(df['Date_x'])
df['Date_y'] = pd.to_datetime(df['Date_y'])
Y = 2000 # dummy leap year to allow input X-02-29 (leap day)
seasons = [('Winter', (date(Y, 1, 1), date(Y, 3, 20))),
('Spring', (date(Y, 3, 21), date(Y, 6, 20))),
('Summer', (date(Y, 6, 21), date(Y, 9, 22))),
('Fall', (date(Y, 9, 23), date(Y, 12, 20))),
('Winter', (date(Y, 12, 21), date(Y, 12, 31)))]
def get_season(x):
x = x.replace(year=Y)
return next(season for season, (start, end) in seasons
if start <= x <= end)
def get_holiday():
return pd.DataFrame([(h,y,z) for (h,d) in zip(holidays['Holiday'], holidays['Date'])
for (y, z) in zip(df['Date_x'], df['Date_y']) if y.date() <= d.date() if d.date() <= z.date()], columns=['Holiday', 'Date_x', 'Date_y'])
s1 = df['Date_x'].apply(lambda x: get_season(x))
s2 = df['Date_y'].apply(lambda x: get_season(x))
df['Season']= [', '.join(list(set([x,y]))) for (x,y) in zip(s1,s2)]
dft = get_holiday()
dft = dft.groupby(['Date_x', 'Date_y'])['Holiday'].apply(lambda x: ', '.join(list(x)))
df = pd.merge(df, dft, how='left', on=['Date_x', 'Date_y'])
So running the code above I get this output:
I don’t want all those duplicate Holiday values in each field, it gets much worse with my actual file, I’ll add that csv here as well, just in case.
Here is a google sheets link to the full set of datas that I am working with.
Upvotes: 0
Views: 44
Reputation: 8508
@Larry, I just added a simple drop_duplicates() statement to remove them.
df = pd.merge(df, dft, how='left', on=['Date_x', 'Date_y'])
df = df.drop_duplicates()
print(df)
I added the drop_duplicates() right after your last statement
Upvotes: 0
Reputation: 6298
Change:
dft = dft.groupby(['Date_x', 'Date_y'])['Holiday'].apply(lambda x: ', '.join(list(x)))
To:
dft = dft.groupby(['Date_x', 'Date_y'])['Holiday'].apply(lambda x: ', '.join(set(x)))
Upvotes: 1