Reputation: 638
I have the following dataframe:
Input:-
ID month Name
A1 2017.01 A
A1 2017.02 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.05 C
Output:-
ID month Name
A1 2017.01 A
A1 2017.02 B
A1 2017.03 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.04 D
A2 2017.05 C
I require to get the missing months in the sequence and the value of the month preceding it and which is present in the input list. Consider example of ID "A1". "A1" has months 1,2,4 and has missing month 3. So i need to add the row with value "A1" as ID, month as "2017.03" and Name as "B". Please note the "Name" column should get its value from the row immediately above it that is present in the input.
How do I achieve this in pandas, or by any other method in python.
Any help is appreciated! Thanks
Upvotes: 1
Views: 2344
Reputation: 795
There was a question in the comments about how df knows which column to ffill, and I just decided to go through it and post it here, maybe someone finds it useful (or I use it as a reference for myself):
mytest = pd.DataFrame({'ID': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2'], 'month': ['2017.01', '2017.02', '2017.04', '2017.02', '2017.03', '2017.05'], 'Name':['A','B','C','A','D','C']})
mytest.month = pd.to_datetime(mytest.month)
mytest=mytest.set_index('month').groupby(['ID'])
mytest = mytest.resample('MS').asfreq()['Name']
mytest = pd.DataFrame(pd.DataFrame(mytest).to_records())
mytest.Name = mytest.Name.ffill()
mytest
Obviously outputs a very similar thing, I just have not formatted months back to the original format.
ID month Name
0 A1 2017-01-01 A
1 A1 2017-02-01 B
2 A1 2017-03-01 B
3 A1 2017-04-01 C
4 A2 2017-02-01 A
5 A2 2017-03-01 D
6 A2 2017-04-01 D
7 A2 2017-05-01 C
Upvotes: 0
Reputation: 153540
Let's try this with @EFT's suggestion:
df['Date'] = pd.to_datetime(df.month.astype(str),format='%Y.%m')
df_out = df.set_index('Date').groupby('ID').resample('MS').asfreq().ffill().reset_index(level=0, drop=True)
df_out = df_out.reset_index()
df_out['month'] = df_out.Date.dt.strftime('%Y.%m')
df_out = df_out.drop('Date',axis=1)
print(df_out)
Output:
ID month Name
0 A1 2017.01 A
1 A1 2017.02 B
2 A1 2017.03 B
3 A1 2017.04 C
4 A2 2017.02 A
5 A2 2017.03 D
6 A2 2017.04 D
7 A2 2017.05 C
Upvotes: 5