Calculating percentage of days covered in a range with a groupby Pandas function

Question

I have a simple Pandas dataframe where each row represents a person and a date range. For each person, I'd like to know what percentage of days in a hard-coded range (defined by variables period_start and period_end) are covered across the various entries in the dataframe.

I assume there is an easy way to do this with Pandas, but I haven't been able to find one. I have a solution with multiple dataframes and several nested loops, but this is inefficient at scale. How can I do this more effectively utilizing Pandas? I think a groupby makes sense, but not sure how to do that when the ranges are across two columns and may overlap.

import pandas as pd
from datetime import datetime
df = pd.DataFrame(data=[['2016-01-01', '2016-01-31', 'A'],
                        ['2016-02-02', '2016-02-10', 'A'],
                        ['2016-03-01', '2016-04-01', 'A'],
                        ['2016-01-01', '2016-03-01', 'B']],
                  columns=['startdate', 'enddate', 'person'])

# start and end date
period_start = datetime(year=2016, month=01, day=01)
period_end = datetime(year=2016, month=12, day=31)

# dates_dfculate totals days
total_days =  (period_end-period_start).days  + 1

# convert columns to dates
df['startdate']= pd.to_datetime(df['startdate'],  format='%Y-%m-%d')
df['enddate']= pd.to_datetime(df['enddate'],  format='%Y-%m-%d')

# create a TimeIndex dataframe with columns for each person
rng = pd.date_range(period_start, periods=total_days, freq='D')
people = list(set(df['person'].tolist()))
dates_df = pd.DataFrame(columns=[people], index=rng).fillna(False)

# loop over each date (index)
for index, row in dates_df.iterrows():

   # loop over each column (person) 
   for person in people:
       tmp = df[df['person'] == person]

       # loop over each each entry for the person
       for index1, row1 in tmp.iterrows():

           # check if the date is date index in dates_df is within range
           value = row1['startdate'] <= index <= row1['enddate']

           # if it's not already set to true, set it to true
           if dates_df.ix[index, person] == False and value == True:
               dates_df.ix[index, person] = True

# for each person, show the percentage of days in range that are covered
for person in people:
    print  person, sum(dates_df[person].tolist())/float(total_days)

Desired Output:

A 0.196721311475
B 0.166666666667

dylanjf · Accepted Answer

This should be it, I'm guessing since you're adding 1 to the total days you want to be inclusive on the ranges, but edit it as needed :)

import pandas as pd
from datetime import datetime

df = pd.DataFrame(data=[['2016-01-01', '2016-01-31', 'A'],
                        ['2016-02-02', '2016-02-10', 'A'],
                        ['2016-03-01', '2016-04-01', 'A'],
                        ['2016-01-01', '2016-03-01', 'B']],
                  columns=['startdate', 'enddate', 'person'])

# start and end date
period_start = datetime(year=2016, month=1, day=1)
period_end = datetime(year=2016, month=12, day=31)

# convert columns to dates
df['startdate']= pd.to_datetime(df['startdate'],  format='%Y-%m-%d')
df['enddate']= pd.to_datetime(df['enddate'],  format='%Y-%m-%d')
df['days'] = df.apply(lambda x: max((min(x.enddate, period_end) - max(x.startdate, period_start)).days + 1, 0), axis=1)

#percentage of days in range by person
people_pct = df.groupby('person').apply(lambda x: x.days.sum() / ((period_end - period_start).days + 1))
print(people_pct.head())

-----------------
    person
    A    0.196721
    B    0.166667

You're on the right track - pandas groupby is great for segmenting data, but the real power comes from the .apply() function, which can do either a common math transformation (mean, std, etc) or, as in this case, a custom function.

The lambda within the apply is saying "for each row/column (depending on the axis) within the group, do this custom function and return a Series".

While this covers your question, it's still lacking in detecting unique days, so we're assuming the rows are split with no overlap as your example stated.

Calculating percentage of days covered in a range with a groupby Pandas function

Answers (1)

Related Questions