Quinten
Quinten

Reputation: 41357

Fill missing dates hourly per group with previous value in certain column using Pandas

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 14:00:00', False, 3],
        ['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

  group                 date  indicator  value
0     A  2022-09-01 10:00:00      False      2
1     A  2022-09-01 14:00:00      False      3
2     B  2022-09-01 13:00:00      False      1
3     B  2022-09-01 16:00:00       True      4

I would like to fill in the missing dates between dates hourly. So each hour that is missing between dates should be filled and the values should be the same as the previous data. Here is the desired output:

data = [['A', '2022-09-01 10:00:00', False, 2], ['A', '2022-09-01 11:00:00', False, 2], 
        ['A', '2022-09-01 12:00:00', False, 2], ['A', '2022-09-01 13:00:00', False, 2], 
        ['A', '2022-09-01 14:00:00', False, 3],
        ['B', '2022-09-01 13:00:00', False, 1], ['B', '2022-09-01 14:00:00', False, 1],
        ['B', '2022-09-01 15:00:00', False, 1], ['B', '2022-09-01 16:00:00', True, 4]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'indicator', 'value'])

  group                 date  indicator  value
0     A  2022-09-01 10:00:00      False      2
1     A  2022-09-01 11:00:00      False      2
2     A  2022-09-01 12:00:00      False      2
3     A  2022-09-01 13:00:00      False      2
4     A  2022-09-01 14:00:00      False      3
5     B  2022-09-01 13:00:00      False      1
6     B  2022-09-01 14:00:00      False      1
7     B  2022-09-01 15:00:00      False      1
8     B  2022-09-01 16:00:00       True      4

So I was wondering if it is possible to fill the missing dates hourly per group with the previous value in column value using Pandas?

Upvotes: 2

Views: 83

Answers (3)

sammywemmy
sammywemmy

Reputation: 28699

One option is with complete from pyjanitor, to expose missing rows:

# pip install pyjanitor
import pandas as pd
import janitor
df['date'] = pd.to_datetime(df['date'])

# build a dictionary to contain the new dates
# the key of the dictionary must exist in the dataframe
new_date = {'date': lambda date: pd.date_range(date.min(), date.max(), freq='H')}

df.complete(new_date, by = 'group').ffill(downcast='infer') 
  group                date  indicator  value
0     A 2022-09-01 10:00:00      False      2
1     A 2022-09-01 11:00:00      False      2
2     A 2022-09-01 12:00:00      False      2
3     A 2022-09-01 13:00:00      False      2
4     A 2022-09-01 14:00:00      False      3
5     B 2022-09-01 13:00:00      False      1
6     B 2022-09-01 14:00:00      False      1
7     B 2022-09-01 15:00:00      False      1
8     B 2022-09-01 16:00:00       True      4

Upvotes: 1

Naveed
Naveed

Reputation: 11650

here is one other way about it

df['date']=pd.to_datetime(df['date'])

df2=(df.set_index('date' )
     .groupby('group',  group_keys=False)
     .apply(lambda x: x.resample('1H').ffill()) 
     .reset_index() )
df2
                   date     group   indicator   value
0   2022-09-01 10:00:00        A    False       2
1   2022-09-01 11:00:00        A    False       2
2   2022-09-01 12:00:00        A    False       2
3   2022-09-01 13:00:00        A    False       2
4   2022-09-01 14:00:00        A    False       3
5   2022-09-01 13:00:00        B    False       1
6   2022-09-01 14:00:00        B    False       1
7   2022-09-01 15:00:00        B    False       1
8   2022-09-01 16:00:00        B    True        4

Upvotes: 3

mozway
mozway

Reputation: 260890

You can use:

df['date'] = pd.to_datetime(df['date'])

out = (df
   .groupby('group', as_index=False, group_keys=False)
   .apply(lambda g: g.set_index('date')
                     .reindex(pd.date_range(g['date'].min(),
                                            g['date'].max(),
                                            freq='H'))
                     .ffill(downcast='infer').reset_index()
         
         )
   .reset_index(drop=True)
)

output:

                index group  indicator  value
0 2022-09-01 10:00:00     A      False      2
1 2022-09-01 11:00:00     A      False      2
2 2022-09-01 12:00:00     A      False      2
3 2022-09-01 13:00:00     A      False      2
4 2022-09-01 14:00:00     A      False      3
5 2022-09-01 13:00:00     B      False      1
6 2022-09-01 14:00:00     B      False      1
7 2022-09-01 15:00:00     B      False      1
8 2022-09-01 16:00:00     B       True      4

Upvotes: 2

Related Questions