anthr
anthr

Reputation: 1036

Counting dates in a range set by pandas dataframe

I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.

For example, the table looks like:

index        start_date         end date
     0         '2015-01-01'    '2015-01-17'
     1         '2015-01-03'    '2015-01-12'

And the result would be a per date aggregate, like:

    date           count
'2015-01-01'     1
'2015-01-02'     1
'2015-01-03'     2

and so on.

My current approach works but is extremely slow on a big dataframe as I'm looping across the rows, calculating the range and then looping through this. I'm hoping to find a better approach.

Currently I'm doing :

date = pd.date_range (min (df.start_date), max (df.end_date))
df2 = pd.DataFrame (index =date)
df2 ['count'] = 0

for index, row in df.iterrows ():
    dates = pd.date_range (row ['start_date'], row ['end_date'])
    for date in dates:
        df2.loc['date']['count'] += 1

Upvotes: 4

Views: 6781

Answers (3)

Alexander
Alexander

Reputation: 109528

After stacking the relevant columns as suggested by @Sam, just use value_counts.

df[['start_date', 'end date']].stack().value_counts()

EDIT:

Given that you also want to count the dates between the start and end dates:

start_dates = pd.to_datetime(df.start_date)
end_dates = pd.to_datetime(df.end_date)

>>> pd.Series(dt.date() for group in 
              [pd.date_range(start, end) for start, end in zip(start_dates, end_dates)]  
              for dt in group).value_counts()
Out[178]: 
2015-01-07    2
2015-01-06    2
2015-01-12    2
2015-01-05    2
2015-01-04    2
2015-01-10    2
2015-01-03    2
2015-01-09    2
2015-01-08    2
2015-01-11    2
2015-01-16    1
2015-01-17    1
2015-01-14    1
2015-01-15    1
2015-01-02    1
2015-01-01    1
2015-01-13    1
dtype: int64

Upvotes: 6

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

I would use melt() method for that:

In [76]: df
Out[76]:
      start_date   end_date
index
0     2015-01-01 2015-01-17
1     2015-01-03 2015-01-12
2     2015-01-03 2015-01-17

In [77]: pd.melt(df, value_vars=['start_date','end_date']).groupby('value').size()
Out[77]:
value
2015-01-01    1
2015-01-03    2
2015-01-12    1
2015-01-17    2
dtype: int64

Upvotes: 2

Sam
Sam

Reputation: 4090

I think the solution here is to 'stack' your two date columns, group by the date,and do a count. Play around with the df.stack() function. Here is something i threw together that yields a good solution:

import datetime
df = pd.DataFrame({'Start' : [datetime.date(2016, 5, i) for i in range(1,30)],
                  'End':[datetime.date(2016, 5, i) for i in range(1,30)]})
df.stack().reset_index()[[0, 'level_1']].groupby(0).count()

Upvotes: 2

Related Questions