Larry Flores
Larry Flores

Reputation: 73

How to get rid of duplicate values from each row in a column in a pandas dataframe

I have run into an issue where if there are duplicate dates in both date_x and date_y the output of the Holiday column seems to add an instance of the holiday for each appearance of the rows with the same exact dates. Below is the code I am using with a small sample set of data to illustrate my issue.

from pandas.tseries.holiday import USFederalHolidayCalendar
from datetime import datetime
import pandas as pd

cal = USFederalHolidayCalendar()
holidays = (pd.DataFrame(cal.holidays(return_name=True), columns=['Holiday'])
            .reset_index()
            .rename({'index': 'Date'}, axis=1))
holidays['Date'] = pd.to_datetime(holidays['Date'])
df= pd.DataFrame({'Date_x': {0: '2020-12-22', 1: '2020-06-20', 2: '2020-02-11', 3: '2020-05-22', 4: '2020-12-22', 5: '2020-12-20', 6: '2020-12-20', 7: '2020-12-22'},
                  'Date_y': {0: '2021-01-01', 1: '2020-07-11', 2: '2020-03-27', 3: '2020-06-27', 4: '2021-01-01', 5: '2020-12-26', 6: '2020-12-27', 7: '2021-01-01'}})
df['Date_x'] = pd.to_datetime(df['Date_x'])
df['Date_y'] = pd.to_datetime(df['Date_y'])

Y = 2000 # dummy leap year to allow input X-02-29 (leap day)
seasons = [('Winter', (date(Y,  1,  1),  date(Y,  3, 20))),
           ('Spring', (date(Y,  3, 21),  date(Y,  6, 20))),
           ('Summer', (date(Y,  6, 21),  date(Y,  9, 22))),
           ('Fall', (date(Y,  9, 23),  date(Y, 12, 20))),
           ('Winter', (date(Y, 12, 21),  date(Y, 12, 31)))]

def get_season(x):
    x = x.replace(year=Y)
    return next(season for season, (start, end) in seasons
                if start <= x <= end)


def get_holiday():
    return pd.DataFrame([(h,y,z) for (h,d) in zip(holidays['Holiday'], holidays['Date'])
     for (y, z) in zip(df['Date_x'], df['Date_y']) if y.date() <= d.date() if d.date() <= z.date()], columns=['Holiday', 'Date_x', 'Date_y'])


s1 = df['Date_x'].apply(lambda x: get_season(x))
s2 = df['Date_y'].apply(lambda x: get_season(x))
df['Season']= [', '.join(list(set([x,y]))) for (x,y) in zip(s1,s2)]
dft = get_holiday()
dft = dft.groupby(['Date_x', 'Date_y'])['Holiday'].apply(lambda x: ', '.join(list(x)))
df = pd.merge(df, dft, how='left', on=['Date_x', 'Date_y'])

So running the code above I get this output:

Incorrect output I don’t want all those duplicate Holiday values in each field, it gets much worse with my actual file, I’ll add that csv here as well, just in case.

Here is a google sheets link to the full set of datas that I am working with.

Upvotes: 0

Views: 44

Answers (2)

Joe Ferndz
Joe Ferndz

Reputation: 8508

@Larry, I just added a simple drop_duplicates() statement to remove them.

df = pd.merge(df, dft, how='left', on=['Date_x', 'Date_y'])
df = df.drop_duplicates()
print(df)

I added the drop_duplicates() right after your last statement

Upvotes: 0

Aviv Yaniv
Aviv Yaniv

Reputation: 6298

Change:

dft = dft.groupby(['Date_x', 'Date_y'])['Holiday'].apply(lambda x: ', '.join(list(x)))

To:

dft = dft.groupby(['Date_x', 'Date_y'])['Holiday'].apply(lambda x: ', '.join(set(x)))

Upvotes: 1

Related Questions