ABY
ABY

Reputation: 393

python how to find the number of days in each month between two date columns

I have two date columns Start Date and End Date and I want to find the Year and number of days in each month between those two dates. I can find a year but no idea how to find the number of days in each month. Not sure if it is feasible to get this o/p.

from pandas import DataFrame
import re

df = {'Id': ['1','2','3','4','5'],
      'Item': ['A','B','C','D','E'],
        'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
        'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
        }
df = DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])

Expected O/P: enter image description here

Upvotes: 0

Views: 1536

Answers (2)

CypherX
CypherX

Reputation: 7353

Solution

You can employ a combination of vectorization with pandas and numpy to achieve this as follows. The custom function is provided below, for ease of use. Since, it uses vectorization, it is supposed to be fairly fast.

Note: The assumption used here, based on the sample data:

  • The date range is only for one year
  • both start and end dates fall in the same year.

If you have data from different years, you would need to apply this to each year's data there. Also, if start and end dates fall on different years, you will have to adapt this method for that. Since the problem presented here, does not state that requirement, I leave this implementation as a guide for anyone interested in its application to a multiyear-spanned dataset.


If you would like to try out this solution in jupyter notebook environment, you can access it here on github. It has a Google Colaboratory link as well. So, you could also, directly open it in Google Colab Notebook.

# Updated DataFrame
df = process_dataframe(df) # custom function
display(df.head())

enter image description here

Dummy Data and Custom Function

Tested with pandas==0.25.3 and numpy==0.17.4 in Google Colab Environment.

import numpy as np
import pandas as pd
#from pandas.tseries.offsets import MonthEnd
from IPython.display import display

# Dummy Data
df = {'Id': ['1','2','3','4','5'],
      'Item': ['A','B','C','D','E'],
        'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
        'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
        }
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])

# Function for Processing the DataFrame
def process_dataframe(df):
    """Returns the updated dataframe. """
    df.StartDate = pd.to_datetime(df.StartDate)
    df.EndDate = pd.to_datetime(df.EndDate)

    month_ends = pd.date_range(start='2019-01', freq='M', periods=12)    
    month_headers = month_ends.month_name().str.upper().str[:3].tolist()
    month_days = month_ends.day.to_numpy()
    month_nums = (np.arange(12) + 1)
    # Evaluate expressions to avoid multiple times evaluation
    start_date_month_num = df.StartDate.dt.month.to_numpy().reshape(-1,1)
    end_date_month_num = df.EndDate.dt.month.to_numpy().reshape(-1,1)

    #start_month_days = pd.to_datetime(df.StartDate, format="%Y%m") + MonthEnd(1) - df.StartDate
    # start_month_days.dt.days.to_numpy()
    # Number of days not in the end_month
    end_month_days_excluded = month_days[df.EndDate.dt.month.to_numpy() - 1] - df.EndDate.dt.day.to_numpy()

    # Determine the months that fall within the start and end dates (inclusive 
    # of start and end months) and then calculate the number of days in each 
    # month.

    # add all days for relevant months
    result = ((start_date_month_num <= month_nums) & \
            (end_date_month_num >= month_nums)).astype(int) \
            * month_days.reshape(1,-1) 
    # subtract number of days not in starting month
    result = result + \
            (-1) * (start_date_month_num == month_nums).astype(int) \
            * (df.StartDate.dt.day.to_numpy() - 1).reshape(-1,1) 
    # subtract number of days not in ending month            
    result = result + \
            (-1) * (end_date_month_num == month_nums).astype(int) \
            * end_month_days_excluded.reshape(-1,1) 

    return pd.merge(df, pd.DataFrame(result, columns = month_headers), left_index=True, right_index=True)

Original DataFrame:

# Original DataFrame
display(df.head())

Output:
enter image description here

Upvotes: 0

Andy L.
Andy L.

Reputation: 25259

I come up with solution using pd.date_range and resample. You need convert both columns StartDate and EndDate to datetime dtype

df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])

def days_of_month(x):
    s = pd.date_range(*x, freq='D').to_series()
    return s.resample('M').count().rename(lambda x: x.month)

df1 = df[['StartDate', 'EndDate']].apply(days_of_month, axis=1).fillna(0)

Out[1036]:
     1     2     3     4     5     6     7     8     9     10    11    12
0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  21.0
1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  31.0
2  31.0  28.0  30.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
3   0.0   0.0   0.0   0.0  22.0  30.0  31.0  31.0  30.0  31.0  30.0   0.0
4   0.0   0.0  22.0  30.0  31.0  10.0   0.0   0.0   0.0   0.0   0.0   0.0

Finally join back to original dataframe

df_final = df[['StartDate', 'EndDate']].join([df['StartDate'].dt.year.rename('Year'), df1])

Out[1042]:
   StartDate    EndDate  Year     1     2     3     4     5     6     7     8  \
0 2019-12-10 2019-12-30  2019   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
1 2019-12-01 2019-12-31  2019   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
2 2019-01-01 2019-03-30  2019  31.0  28.0  30.0   0.0   0.0   0.0   0.0   0.0
3 2019-05-10 2019-11-30  2019   0.0   0.0   0.0   0.0  22.0  30.0  31.0  31.0
4 2019-03-10 2019-06-10  2019   0.0   0.0  22.0  30.0  31.0  10.0   0.0   0.0

      9    10    11    12
0   0.0   0.0   0.0  21.0
1   0.0   0.0   0.0  31.0
2   0.0   0.0   0.0   0.0
3  30.0  31.0  30.0   0.0
4   0.0   0.0   0.0   0.0

Upvotes: 1

Related Questions