Reputation: 313
I have to determine if there are gaps between date sets (determined by start and end date). I have two example dataframes:
import pandas as pd
a = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016'],
'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017']})
order = ['start_date', 'end_date']
a = a[order]
a.start_date = pd.to_datetime(a.start_date, dayfirst= True)
a.end_date = pd.to_datetime(a.end_date, dayfirst= True)
b = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016',
'05-01-2017', '01-01-2015'],
'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017',
'05-01-2018', '05-01-2018']})
order = ['start_date', 'end_date']
b = b[order]
b.start_date = pd.to_datetime(b.start_date, dayfirst= True)
b.end_date = pd.to_datetime(b.end_date, dayfirst= True)
a
b
For dataframe a
, the solution is simple enough. Order by start_date
, shift end_date
down by one and subtract the dates, if the difference is positive, there is a gap in the dates.
However, achieving this for dataframe b
is less obvious as there is a range that emcompases a larger range. I am unsure on a generic way of doing this that won't incorrectly find a gap. This will be done on grouped data (about 40000 groups).
Upvotes: 2
Views: 1753
Reputation: 210842
IIUC you can do something like this:
In [198]: (b.sort_values('start_date')
...: .stack()
...: .shift().diff().dt.days
...: .reset_index(name='days')
...: .dropna()
...: .query("level_1 == 'end_date' and days != 0"))
...:
Out[198]:
level_0 level_1 days
5 4 end_date -365.0
7 2 end_date -731.0
The following code should show us indices where gaps were found:
In [199]: (b.sort_values('start_date')
...: .stack()
...: .shift().diff().dt.days
...: .reset_index(name='days')
...: .dropna()
...: .query("level_1 == 'end_date' and days != 0")
...: .loc[:, 'level_0'])
...:
Out[199]:
5 4
7 2
Name: level_0, dtype: int64
Upvotes: 1
Reputation: 294258
This is the idea...
+1
for start dates and a -1
for end dates.numpy
to sort stuff and twist and turnreturn
a boolean mask of where the gaps start.def find_gaps(b):
d1 = b.values.ravel()
d2 = np.tile([1, -1], len(d1) // 2)
s = np.lexsort([-d2, d1])
u = np.empty_like(s)
r = np.arange(d1.size)
u[s] = r
return d2[s].cumsum()[u][1::2] == 0
demo
find_gaps(b)
array([False, False, False, False, True], dtype=bool)
find_gaps(a)
array([False, True, True], dtype=bool)
Upvotes: 1