Fungie
Fungie

Reputation: 313

Determining if there is a gap in dates using pandas

I have to determine if there are gaps between date sets (determined by start and end date). I have two example dataframes:

import pandas as pd

a = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016'],
             'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017']})

order = ['start_date', 'end_date']

a = a[order]

a.start_date = pd.to_datetime(a.start_date, dayfirst= True)
a.end_date = pd.to_datetime(a.end_date, dayfirst= True)


b = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016', 
'05-01-2017', '01-01-2015'],
             'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017',
                          '05-01-2018', '05-01-2018']})

order = ['start_date', 'end_date']

b = b[order]

b.start_date = pd.to_datetime(b.start_date, dayfirst= True)
b.end_date = pd.to_datetime(b.end_date, dayfirst= True)

a
b

For dataframe a, the solution is simple enough. Order by start_date, shift end_date down by one and subtract the dates, if the difference is positive, there is a gap in the dates.

However, achieving this for dataframe b is less obvious as there is a range that emcompases a larger range. I am unsure on a generic way of doing this that won't incorrectly find a gap. This will be done on grouped data (about 40000 groups).

Upvotes: 2

Views: 1753

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

IIUC you can do something like this:

In [198]: (b.sort_values('start_date')
     ...:   .stack()
     ...:   .shift().diff().dt.days
     ...:   .reset_index(name='days')
     ...:   .dropna()
     ...:   .query("level_1 == 'end_date' and days != 0"))
     ...:
Out[198]:
   level_0   level_1   days
5        4  end_date -365.0
7        2  end_date -731.0

The following code should show us indices where gaps were found:

In [199]: (b.sort_values('start_date')
     ...:   .stack()
     ...:   .shift().diff().dt.days
     ...:   .reset_index(name='days')
     ...:   .dropna()
     ...:   .query("level_1 == 'end_date' and days != 0")
     ...:   .loc[:, 'level_0'])
     ...:
Out[199]:
5    4
7    2
Name: level_0, dtype: int64

Upvotes: 1

piRSquared
piRSquared

Reputation: 294258

This is the idea...

  • Assign a +1 for start dates and a -1 for end dates.
  • Take a cumulative sum where I order by all dates as one flat array.
  • When cumulative sum is zero... we hit a gap.
  • Date values are the first priority, followed by being a start_date. This way, we don't add a negative one before adding a positive one when the end_date of one row equals the start date of the next row.
  • I use numpy to sort stuff and twist and turn
  • return a boolean mask of where the gaps start.

def find_gaps(b):
    d1 = b.values.ravel()
    d2 = np.tile([1, -1], len(d1) // 2)
    s = np.lexsort([-d2, d1])
    u = np.empty_like(s)
    r = np.arange(d1.size)
    u[s] = r
    return d2[s].cumsum()[u][1::2] == 0

demo

find_gaps(b)

array([False, False, False, False,  True], dtype=bool)

find_gaps(a)

array([False,  True,  True], dtype=bool)

Upvotes: 1

Related Questions