Efficiently finding overlap between many date ranges

Question

How can I efficiently find overlapping dates between many date ranges?

I have a pandas dataframe containing information on the daily warehouse stock of many products. There are only records for those dates where stock actually changed.

import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
                  'stock': [10, 0, 10, 5, 0, 5],
                  'date': ['2016-01-01', '2016-01-05', '2016-01-15',
                          '2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
Out[4]: 
        date product  stock
0 2016-01-01       a     10
1 2016-01-05       a      0
2 2016-01-15       a     10
3 2016-01-01       b      5
4 2016-01-10       b      0
5 2016-01-20       b      5

From this data I want to identify the number of days where stock of all products was 0. In the example this would be 5 days (from 2016-01-10 to 2016-01-14).

I initially tried resampling the date to create one record for every day and then comparing day by day. This works but it creates a very large dataframe, that I can hardly keep in Memory, because my data contains many dates where stock does not change.

Is there a more memory-efficient way to calculate overlaps other than creating a record for every date and comparing day by day?

Maybe I can somehow create a period representation for the time range implicit in every records and then compare all periods for all products? Another option could be to first subset only those time periods where a product has zero stock (relatively few) and then apply the resampling only on that subset of the data. What other, more efficient ways are there?

Stop harming Monica · Accepted Answer

You can pivot the table using the dates as index and the products as columns, then fill nan's with previous values, convert to daily frequency and look for rows with 0's in all columns.

ptable = (df.pivot(index='date', columns='product', values='stock')
          .fillna(method='ffill').asfreq('D', method='ffill'))
cond = ptable.apply(lambda x: (x == 0).all(), axis='columns')
print(ptable.index[cond])

DatetimeIndex(['2016-01-10', '2016-01-11', '2016-01-12', '2016-01-13',
               '2016-01-14'],
              dtype='datetime64[ns]', name=u'date', freq='D')

Efficiently finding overlap between many date ranges

Answers (2)

Related Questions