Reputation: 2904
I have one dataframe like below:
name,date
AAA,201705
AAA,201706
AAA,201707
AAA,201708
AAA,201710
AAA,201711
AAA,201802
AAA,201803
AAA,201804
AAA,201805
AAA,201806
AAA,201807
In this dataframe two columns are available i.e. name and date. In the date column only year and month is available in yyyymm format.
In date column values 201709, 201712 and 201801 months are not available.
Need to check all the months are present or not. If any month is not available then need the ouput in below format:
name,start_date,end_date,count
AAA,201709,201709,1
AAA,201712,201801,2
I am trying to use pandas diff function
Upvotes: 2
Views: 256
Reputation: 862591
Use asfreq
:
#convert column to datetimes
df['date'] = pd.to_datetime(df['date'], format='%Y%m')
# get missing values by asfreq
a = df.set_index('date').groupby('name')['name'].apply(lambda x: x.asfreq('MS'))
#filter only NaNs consecutive rows
b = a.notnull().cumsum()[a.isnull()].reset_index(name='g')
#aggregate first, last and count
d = {'date':['first','last'],'name':['first', 'size']}
df = b.groupby('g').agg(d).reset_index(drop=True)
#data cleaning
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'date_first':'start_date',
'date_last':'end_date',
'name_first':'name',
'name_size':'count'})
print (df)
start_date end_date name count
0 2017-09-01 2017-09-01 AAA 1
1 2017-12-01 2018-01-01 AAA 2
Detail:
print (a)
name date
AAA 2017-05-01 AAA
2017-06-01 AAA
2017-07-01 AAA
2017-08-01 AAA
2017-09-01 NaN
2017-10-01 AAA
2017-11-01 AAA
2017-12-01 NaN
2018-01-01 NaN
2018-02-01 AAA
2018-03-01 AAA
2018-04-01 AAA
2018-05-01 AAA
2018-06-01 AAA
2018-07-01 AAA
Name: name, dtype: object
Upvotes: 3