pip
pip

Reputation: 39

Create a new dataframe counting of all the dates between two dates in another dataframe

I have a dataframe with columns 'start_date' and 'end_date' for about 2,000 records. I want to create a new dataframe that counts all of the dates between the start and end date and creates a dataframe that sums these counts for each date as follows:

start and end dataframe

ID start_date end_date
1 01/01/2021 03/01/2021
2 02/01/2021 04/01/2021
3 01/01/2021 04/01/2021

date count dataframe

date count
01/01/2021 2
02/01/2021 3
03/01/2021 3
04/01/2021 2

Upvotes: 2

Views: 749

Answers (3)

SeaBean
SeaBean

Reputation: 23217

You can use:

# Convert dates in dd/mm/yyyy to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)

# Create date ranges for each row
date_rng = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']), axis=1)

# expand dates in range dates into separate rows and count unique dates
df_out = df.assign(date=date_rng).explode('date').groupby('date')['date'].count().reset_index(name='count')

Result:

print(df_out)

        date  count
0 2021-01-01      2
1 2021-01-02      3
2 2021-01-03      3
3 2021-01-04      2

Upvotes: 0

ThePyGuy
ThePyGuy

Reputation: 18416

You can use pd.date_range to get individual dates between start_date and end_date for each of the rows, then explode it, and finally call value_counts

>>> out = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']),
                   axis=1).explode().value_counts()

If needed call to_frame() passing the column name for counts and reset and rename the index column:

>>> out.to_frame('count').reset_index().rename(columns={'index':'date'})

OUTPUT:

        date  count
0 2021-01-03      3
1 2021-01-02      3
2 2021-01-04      2
3 2021-01-01      2

Don't forget to convert start_date and end_date columns to datetime type if they are not:

>>> df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
>>> df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)

Upvotes: 0

jezrael
jezrael

Reputation: 862641

If medium/large DataFrame for better performance is better avoid explode with date_range, better is use repeat with adding timedeltas:

df["start_date"] = pd.to_datetime(df["start_date"], dayfirst=True)
df["end_date"] = pd.to_datetime(df["end_date"], dayfirst=True)

#subtract values and convert to days
s = df["end_date"].sub(df["start_date"]).dt.days + 1

#repeat index
df = df.loc[df.index.repeat(s)].copy()

#add days by timedeltas
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df1 = (df["start_date"].add(add)
                       .value_counts()
                       .sort_index()
                       .rename_axis('date')
                       .reset_index(name='count'))

print (df1)
        date  count
0 2021-01-01      2
1 2021-01-02      3
2 2021-01-03      3
3 2021-01-04      2

Upvotes: 3

Related Questions