Reputation: 241

How to reduce time complexity or improve the efficiency of the program finding gaps of month using python pandas

Input is like this

i am finding date gaps and using this.But it is taking too long time to complete the function for large data how can i reduce time complexity of

#convert to datetimes
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
#resample by start of months with asfreq
mdf = month.set_index('data').groupby(['series_id','symbol'])['series_id'].resample('MS').asfreq().rename('val').reset_index()
x = mdf['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
mdf.index = x.cumsum()
#filter only NaNs row and aggregate first, last and count.
mdf = (mdf[~x.values].groupby(['series_id','symbol','g'])['data'].agg(['first','last','size']).reset_index(level=2, drop=True).reset_index())

print mdf

 Id      first       last  size
0  A 2015-06-01 2015-06-01     1
1  A 2015-08-01 2015-08-01     1
2  B 2000-02-01 2000-02-01     1
3  C 2003-03-01 2003-04-01     2

How can i reduce the time complexity or some other way to find the date gaps.

Upvotes: 1

Answers (2)

Ben.T

Reputation: 29635

So using a bit the idea @RushabhMehta, you can us pd.DateOffset to create the output dataframe. Your input dataframe is called month, with column 'data' and 'series_id', according to your code. Here is the idea:

month['data'] = pd.to_datetime(month['data'], format='%Y%m')
month = month.sort_values(['series_id','data'])
# create mdf with the column you want
mdf = pd.DataFrame({'Id':month.series_id, 'first':month.data + pd.DateOffset(months=1), 
                    'last': (month.groupby('series_id').data.shift(-1) - pd.DateOffset(months=1))})

Note how the column 'last' is created, using groupby, shift the value and substract a month with pd.DateOffset(months=1). Now select only the rows where the date in 'first' is before the one in 'last' and create the column size such as:

mdf = mdf.loc[mdf['first'] <= mdf['last']]
mdf['size'] = (mdf['last']- mdf['first']).astype('timedelta64[M]')+1

mdf looks like:

       first Id       last  size
0 2015-06-01  A 2015-06-01   1.0
1 2015-08-01  A 2015-08-01   1.0
3 2000-02-01  B 2000-02-01   1.0
6 2000-03-01  C 2000-04-01   2.0

Just need to reorder column and reset_index if you want.

Upvotes: 1

Rushabh Mehta

Reputation: 1559

The assumptions made are the following:

All values in the Data column are unique, even across groups
The data in the data column are integers
The data is sorted by group first and then by value.

Here is my algorithm (mdf is the input df):

import pandas as pd
df2 = pd.DataFrame({'Id':mdf['Id'],'First':mdf['Data']+1,'Last':(mdf['Data']-1).shift(-1)})
df2 = df2.groupby('Id').apply(lambda g: g[g['Data'] != g['Data'].max()]).reset_index(drop=True)
print(df2[~df['First'].isin(mdf['Data'])&~df['Last'].isin(mdf['Data'])])

Upvotes: 1

How to reduce time complexity or improve the efficiency of the program finding gaps of month using python pandas

Answers (2)

Related Questions