Reputation: 393
I have two date columns Start Date and End Date and I want to find the Year and number of days in each month between those two dates. I can find a year but no idea how to find the number of days in each month. Not sure if it is feasible to get this o/p.
from pandas import DataFrame
import re
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
}
df = DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
Upvotes: 0
Views: 1536
Reputation: 7353
You can employ a combination of vectorization with pandas and numpy to achieve this as follows. The custom function is provided below, for ease of use. Since, it uses vectorization, it is supposed to be fairly fast.
Note: The assumption used here, based on the sample data:
If you have data from different years, you would need to apply this to each year's data there. Also, if start and end dates fall on different years, you will have to adapt this method for that. Since the problem presented here, does not state that requirement, I leave this implementation as a guide for anyone interested in its application to a multiyear-spanned dataset.
If you would like to try out this solution in jupyter notebook environment, you can access it here on github. It has a Google Colaboratory link as well. So, you could also, directly open it in Google Colab Notebook.
# Updated DataFrame
df = process_dataframe(df) # custom function
display(df.head())
Tested with pandas==0.25.3
and numpy==0.17.4
in Google Colab Environment.
import numpy as np
import pandas as pd
#from pandas.tseries.offsets import MonthEnd
from IPython.display import display
# Dummy Data
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
# Function for Processing the DataFrame
def process_dataframe(df):
"""Returns the updated dataframe. """
df.StartDate = pd.to_datetime(df.StartDate)
df.EndDate = pd.to_datetime(df.EndDate)
month_ends = pd.date_range(start='2019-01', freq='M', periods=12)
month_headers = month_ends.month_name().str.upper().str[:3].tolist()
month_days = month_ends.day.to_numpy()
month_nums = (np.arange(12) + 1)
# Evaluate expressions to avoid multiple times evaluation
start_date_month_num = df.StartDate.dt.month.to_numpy().reshape(-1,1)
end_date_month_num = df.EndDate.dt.month.to_numpy().reshape(-1,1)
#start_month_days = pd.to_datetime(df.StartDate, format="%Y%m") + MonthEnd(1) - df.StartDate
# start_month_days.dt.days.to_numpy()
# Number of days not in the end_month
end_month_days_excluded = month_days[df.EndDate.dt.month.to_numpy() - 1] - df.EndDate.dt.day.to_numpy()
# Determine the months that fall within the start and end dates (inclusive
# of start and end months) and then calculate the number of days in each
# month.
# add all days for relevant months
result = ((start_date_month_num <= month_nums) & \
(end_date_month_num >= month_nums)).astype(int) \
* month_days.reshape(1,-1)
# subtract number of days not in starting month
result = result + \
(-1) * (start_date_month_num == month_nums).astype(int) \
* (df.StartDate.dt.day.to_numpy() - 1).reshape(-1,1)
# subtract number of days not in ending month
result = result + \
(-1) * (end_date_month_num == month_nums).astype(int) \
* end_month_days_excluded.reshape(-1,1)
return pd.merge(df, pd.DataFrame(result, columns = month_headers), left_index=True, right_index=True)
Original DataFrame:
# Original DataFrame
display(df.head())
Upvotes: 0
Reputation: 25259
I come up with solution using pd.date_range
and resample
. You need convert both columns StartDate
and EndDate
to datetime dtype
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
def days_of_month(x):
s = pd.date_range(*x, freq='D').to_series()
return s.resample('M').count().rename(lambda x: x.month)
df1 = df[['StartDate', 'EndDate']].apply(days_of_month, axis=1).fillna(0)
Out[1036]:
1 2 3 4 5 6 7 8 9 10 11 12
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 21.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 31.0
2 31.0 28.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 22.0 30.0 31.0 31.0 30.0 31.0 30.0 0.0
4 0.0 0.0 22.0 30.0 31.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0
Finally join back to original dataframe
df_final = df[['StartDate', 'EndDate']].join([df['StartDate'].dt.year.rename('Year'), df1])
Out[1042]:
StartDate EndDate Year 1 2 3 4 5 6 7 8 \
0 2019-12-10 2019-12-30 2019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2019-12-01 2019-12-31 2019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2019-01-01 2019-03-30 2019 31.0 28.0 30.0 0.0 0.0 0.0 0.0 0.0
3 2019-05-10 2019-11-30 2019 0.0 0.0 0.0 0.0 22.0 30.0 31.0 31.0
4 2019-03-10 2019-06-10 2019 0.0 0.0 22.0 30.0 31.0 10.0 0.0 0.0
9 10 11 12
0 0.0 0.0 0.0 21.0
1 0.0 0.0 0.0 31.0
2 0.0 0.0 0.0 0.0
3 30.0 31.0 30.0 0.0
4 0.0 0.0 0.0 0.0
Upvotes: 1