howard roark
howard roark

Reputation: 638

Insert Missing Months rows in the dataframe in python

I have the following dataframe:

Input:-

ID month   Name
A1 2017.01 A
A1 2017.02 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.05 C

Output:-

ID month   Name
A1 2017.01 A
A1 2017.02 B
A1 2017.03 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.04 D
A2 2017.05 C

I require to get the missing months in the sequence and the value of the month preceding it and which is present in the input list. Consider example of ID "A1". "A1" has months 1,2,4 and has missing month 3. So i need to add the row with value "A1" as ID, month as "2017.03" and Name as "B". Please note the "Name" column should get its value from the row immediately above it that is present in the input.

How do I achieve this in pandas, or by any other method in python.

Any help is appreciated! Thanks

Upvotes: 1

Views: 2344

Answers (2)

delimiter
delimiter

Reputation: 795

There was a question in the comments about how df knows which column to ffill, and I just decided to go through it and post it here, maybe someone finds it useful (or I use it as a reference for myself):

mytest = pd.DataFrame({'ID': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2'], 'month': ['2017.01', '2017.02', '2017.04', '2017.02', '2017.03', '2017.05'], 'Name':['A','B','C','A','D','C']})
mytest.month = pd.to_datetime(mytest.month)
mytest=mytest.set_index('month').groupby(['ID'])
mytest = mytest.resample('MS').asfreq()['Name']
mytest = pd.DataFrame(pd.DataFrame(mytest).to_records())
mytest.Name = mytest.Name.ffill()
mytest

Obviously outputs a very similar thing, I just have not formatted months back to the original format.

    ID  month   Name
0   A1  2017-01-01  A
1   A1  2017-02-01  B
2   A1  2017-03-01  B
3   A1  2017-04-01  C
4   A2  2017-02-01  A
5   A2  2017-03-01  D
6   A2  2017-04-01  D
7   A2  2017-05-01  C

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153540

Let's try this with @EFT's suggestion:

df['Date'] = pd.to_datetime(df.month.astype(str),format='%Y.%m')
df_out = df.set_index('Date').groupby('ID').resample('MS').asfreq().ffill().reset_index(level=0, drop=True)
df_out = df_out.reset_index()
df_out['month'] = df_out.Date.dt.strftime('%Y.%m')
df_out = df_out.drop('Date',axis=1)
print(df_out)

Output:

   ID    month Name
0  A1  2017.01    A
1  A1  2017.02    B
2  A1  2017.03    B
3  A1  2017.04    C
4  A2  2017.02    A
5  A2  2017.03    D
6  A2  2017.04    D
7  A2  2017.05    C

Upvotes: 5

Related Questions