Reputation: 39
I have a dataframe with columns 'start_date' and 'end_date' for about 2,000 records. I want to create a new dataframe that counts all of the dates between the start and end date and creates a dataframe that sums these counts for each date as follows:
start and end dataframe
ID | start_date | end_date |
---|---|---|
1 | 01/01/2021 | 03/01/2021 |
2 | 02/01/2021 | 04/01/2021 |
3 | 01/01/2021 | 04/01/2021 |
date count dataframe
date | count |
---|---|
01/01/2021 | 2 |
02/01/2021 | 3 |
03/01/2021 | 3 |
04/01/2021 | 2 |
Upvotes: 2
Views: 749
Reputation: 23217
You can use:
# Convert dates in dd/mm/yyyy to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
# Create date ranges for each row
date_rng = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']), axis=1)
# expand dates in range dates into separate rows and count unique dates
df_out = df.assign(date=date_rng).explode('date').groupby('date')['date'].count().reset_index(name='count')
Result:
print(df_out)
date count
0 2021-01-01 2
1 2021-01-02 3
2 2021-01-03 3
3 2021-01-04 2
Upvotes: 0
Reputation: 18416
You can use pd.date_range
to get individual dates between start_date
and end_date
for each of the rows, then explode it, and finally call value_counts
>>> out = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date']),
axis=1).explode().value_counts()
If needed call to_frame()
passing the column name for counts and reset and rename the index column:
>>> out.to_frame('count').reset_index().rename(columns={'index':'date'})
OUTPUT:
date count
0 2021-01-03 3
1 2021-01-02 3
2 2021-01-04 2
3 2021-01-01 2
Don't forget to convert start_date
and end_date
columns to datetime
type if they are not:
>>> df['start_date'] = pd.to_datetime(df['start_date'], dayfirst=True)
>>> df['end_date'] = pd.to_datetime(df['end_date'], dayfirst=True)
Upvotes: 0
Reputation: 862641
If medium/large DataFrame for better performance is better avoid explode
with date_range
, better is use repeat
with adding timedeltas
:
df["start_date"] = pd.to_datetime(df["start_date"], dayfirst=True)
df["end_date"] = pd.to_datetime(df["end_date"], dayfirst=True)
#subtract values and convert to days
s = df["end_date"].sub(df["start_date"]).dt.days + 1
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df1 = (df["start_date"].add(add)
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count'))
print (df1)
date count
0 2021-01-01 2
1 2021-01-02 3
2 2021-01-03 3
3 2021-01-04 2
Upvotes: 3