raam
raam

Reputation: 241

How to reduce time complexity or improve the efficiency of the program finding gaps of month using python pandas

Input is like this

Data    Id
201505  A
201507  A
201509  A
200001  B
200001  C
200002  C
200005  C

i am finding date gaps and using this.But it is taking too long time to complete the function for large data how can i reduce time complexity of

#convert to datetimes
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
#resample by start of months with asfreq
mdf = month.set_index('data').groupby(['series_id','symbol'])['series_id'].resample('MS').asfreq().rename('val').reset_index()
x = mdf['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
mdf.index = x.cumsum()
#filter only NaNs row and aggregate first, last and count.
mdf = (mdf[~x.values].groupby(['series_id','symbol','g'])['data'].agg(['first','last','size']).reset_index(level=2, drop=True).reset_index())

print mdf

 Id      first       last  size
0  A 2015-06-01 2015-06-01     1
1  A 2015-08-01 2015-08-01     1
2  B 2000-02-01 2000-02-01     1
3  C 2003-03-01 2003-04-01     2

How can i reduce the time complexity or some other way to find the date gaps.

Upvotes: 1

Views: 462

Answers (2)

Ben.T
Ben.T

Reputation: 29635

So using a bit the idea @RushabhMehta, you can us pd.DateOffset to create the output dataframe. Your input dataframe is called month, with column 'data' and 'series_id', according to your code. Here is the idea:

month['data'] = pd.to_datetime(month['data'], format='%Y%m')
month = month.sort_values(['series_id','data'])
# create mdf with the column you want
mdf = pd.DataFrame({'Id':month.series_id, 'first':month.data + pd.DateOffset(months=1), 
                    'last': (month.groupby('series_id').data.shift(-1) - pd.DateOffset(months=1))})

Note how the column 'last' is created, using groupby, shift the value and substract a month with pd.DateOffset(months=1). Now select only the rows where the date in 'first' is before the one in 'last' and create the column size such as:

mdf = mdf.loc[mdf['first'] <= mdf['last']]
mdf['size'] = (mdf['last']- mdf['first']).astype('timedelta64[M]')+1

mdf looks like:

       first Id       last  size
0 2015-06-01  A 2015-06-01   1.0
1 2015-08-01  A 2015-08-01   1.0
3 2000-02-01  B 2000-02-01   1.0
6 2000-03-01  C 2000-04-01   2.0

Just need to reorder column and reset_index if you want.

Upvotes: 1

Rushabh Mehta
Rushabh Mehta

Reputation: 1559

The assumptions made are the following:

  • All values in the Data column are unique, even across groups
  • The data in the data column are integers
  • The data is sorted by group first and then by value.

Here is my algorithm (mdf is the input df):

import pandas as pd
df2 = pd.DataFrame({'Id':mdf['Id'],'First':mdf['Data']+1,'Last':(mdf['Data']-1).shift(-1)})
df2 = df2.groupby('Id').apply(lambda g: g[g['Data'] != g['Data'].max()]).reset_index(drop=True)
print(df2[~df['First'].isin(mdf['Data'])&~df['Last'].isin(mdf['Data'])])

Upvotes: 1

Related Questions