Reputation: 241
Input is like this
Data Id
201505 A
201507 A
201509 A
200001 B
200001 C
200002 C
200005 C
i am finding date gaps and using this.But it is taking too long time to complete the function for large data how can i reduce time complexity of
#convert to datetimes
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
#resample by start of months with asfreq
mdf = month.set_index('data').groupby(['series_id','symbol'])['series_id'].resample('MS').asfreq().rename('val').reset_index()
x = mdf['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
mdf.index = x.cumsum()
#filter only NaNs row and aggregate first, last and count.
mdf = (mdf[~x.values].groupby(['series_id','symbol','g'])['data'].agg(['first','last','size']).reset_index(level=2, drop=True).reset_index())
print mdf
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2
How can i reduce the time complexity or some other way to find the date gaps.
Upvotes: 1
Views: 462
Reputation: 29635
So using a bit the idea @RushabhMehta, you can us pd.DateOffset to create the output dataframe. Your input dataframe is called month
, with column 'data' and 'series_id', according to your code. Here is the idea:
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
month = month.sort_values(['series_id','data'])
# create mdf with the column you want
mdf = pd.DataFrame({'Id':month.series_id, 'first':month.data + pd.DateOffset(months=1),
'last': (month.groupby('series_id').data.shift(-1) - pd.DateOffset(months=1))})
Note how the column 'last' is created, using groupby
, shift
the value and substract a month with pd.DateOffset(months=1)
. Now select only the rows where the date in 'first' is before the one in 'last' and create the column size such as:
mdf = mdf.loc[mdf['first'] <= mdf['last']]
mdf['size'] = (mdf['last']- mdf['first']).astype('timedelta64[M]')+1
mdf
looks like:
first Id last size
0 2015-06-01 A 2015-06-01 1.0
1 2015-08-01 A 2015-08-01 1.0
3 2000-02-01 B 2000-02-01 1.0
6 2000-03-01 C 2000-04-01 2.0
Just need to reorder column and reset_index
if you want.
Upvotes: 1
Reputation: 1559
The assumptions made are the following:
Here is my algorithm (mdf is the input df):
import pandas as pd
df2 = pd.DataFrame({'Id':mdf['Id'],'First':mdf['Data']+1,'Last':(mdf['Data']-1).shift(-1)})
df2 = df2.groupby('Id').apply(lambda g: g[g['Data'] != g['Data'].max()]).reset_index(drop=True)
print(df2[~df['First'].isin(mdf['Data'])&~df['Last'].isin(mdf['Data'])])
Upvotes: 1