malte
malte

Reputation: 559

Efficiently finding overlap between many date ranges

How can I efficiently find overlapping dates between many date ranges?

I have a pandas dataframe containing information on the daily warehouse stock of many products. There are only records for those dates where stock actually changed.

import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
                  'stock': [10, 0, 10, 5, 0, 5],
                  'date': ['2016-01-01', '2016-01-05', '2016-01-15',
                          '2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
Out[4]: 
        date product  stock
0 2016-01-01       a     10
1 2016-01-05       a      0
2 2016-01-15       a     10
3 2016-01-01       b      5
4 2016-01-10       b      0
5 2016-01-20       b      5

From this data I want to identify the number of days where stock of all products was 0. In the example this would be 5 days (from 2016-01-10 to 2016-01-14).

I initially tried resampling the date to create one record for every day and then comparing day by day. This works but it creates a very large dataframe, that I can hardly keep in Memory, because my data contains many dates where stock does not change.

Is there a more memory-efficient way to calculate overlaps other than creating a record for every date and comparing day by day?

Maybe I can somehow create a period representation for the time range implicit in every records and then compare all periods for all products? Another option could be to first subset only those time periods where a product has zero stock (relatively few) and then apply the resampling only on that subset of the data. What other, more efficient ways are there?

Upvotes: 0

Views: 1031

Answers (2)

Stop harming Monica
Stop harming Monica

Reputation: 12618

You can pivot the table using the dates as index and the products as columns, then fill nan's with previous values, convert to daily frequency and look for rows with 0's in all columns.

ptable = (df.pivot(index='date', columns='product', values='stock')
          .fillna(method='ffill').asfreq('D', method='ffill'))
cond = ptable.apply(lambda x: (x == 0).all(), axis='columns')
print(ptable.index[cond])

DatetimeIndex(['2016-01-10', '2016-01-11', '2016-01-12', '2016-01-13',
               '2016-01-14'],
              dtype='datetime64[ns]', name=u'date', freq='D')

Upvotes: 1

Fanchi
Fanchi

Reputation: 787

Here try this, I know its not the prettiest of codes but according to all the data provided here this should work:

from datetime import timedelta
import pandas as pd

df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
                   'stock': [10, 0, 10, 5, 0, 5],
                   'date': ['2016-01-01', '2016-01-05', '2016-01-15',
                            '2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
df = df.sort('date', ascending=True)
no_stock_dates = []
product_stock = {}
in_flag = False
begin = df['date'][0]
for index, row in df.iterrows():
    current = row['date']
    product_stock[row['product']] = row['stock']
    if current > begin:
        if sum(product_stock.values()) == 0 and not in_flag:
            in_flag = True
            begin = row['date']
        if sum(product_stock.values()) != 0 and in_flag:
            in_flag = False
            no_stock_dates.append((begin, current-timedelta(days=1)))

print no_stock_dates

This code should run at O(n*k) where n is the number of lines, and k is the number of product categories.

Upvotes: 0

Related Questions