Reputation: 13426
I have a data in below format
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:55:00 18 1047
Expected Output:
user timestamp flowers total_flowers
xyz 01-01-2020 00:05:00 15 15
xyz 01-01-2020 00:10:00 5 20
xyz 01-01-2020 00:15:00 21 41
xyz 01-01-2020 00:20:00 0 41
xyz 01-01-2020 00:25:00 0 41
xyz 01-01-2020 00:30:00 0 41
xyz 01-01-2020 00:35:00 1 42
...
xyz 01-01-2020 11:45:00 57 1029
xyz 01-01-2020 11:50:00 0 1029
xyz 01-01-2020 11:55:00 18 1047
So I want to fill timestamps with 5 mins interval and fill flowers column by 0 and total_flowers column by previous value(ffill)
My efforts:
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00+05:30")
end_time = pd.to_datetime(f"{end_day} 23:55:00+05:30")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min')
df = df.set_index('timestamp').reindex(dates).reset_index(drop=False).reindex(columns=df.columns)
How do I fill flowers
column with zeros and total_flower
column with ffill
and I am also getting values in timestamp
column as Nan
Actual Output:
user timestamp flowers total_flowers
xyz Nan 15 15
xyz Nan 5 20
xyz Nan 21 41
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan Nan Nan
xyz Nan 1 42
...
xyz Nan 57 1029
xyz Nan Nan Nan
xyz Nan 18 1047
Upvotes: 2
Views: 518
Reputation: 8800
If you construct the dates
such that you can reindex
your timestamps, you can then just do some fillna
and ffill
operations. I had to remove the timezone information, but you should be able to add that back if your data are timezone aware. Here's the full example using some of your data:
d = {'user': {0: 'xyz', 1: 'xyz', 2: 'xyz', 3: 'xyz'},
'timestamp': {0: Timestamp('2020-01-01 00:05:00'),
1: Timestamp('2020-01-01 00:10:00'),
2: Timestamp('2020-01-01 00:15:00'),
3: Timestamp('2020-01-01 00:35:00')},
'flowers': {0: 15, 1: 5, 2: 21, 3: 1},
'total_flowers': {0: 15, 1: 20, 2: 41, 3: 42}}
df = pd.DataFrame(d)
# user timestamp flowers total_flowers
#0 xyz 2020-01-01 00:05:00 15 15
#1 xyz 2020-01-01 00:10:00 5 20
#2 xyz 2020-01-01 00:15:00 21 41
#3 xyz 2020-01-01 00:35:00 1 42
#as you did, but with no TZ
start_day = "01-01-2020"
end_day = "01-01-2020"
start_time = pd.to_datetime(f"{start_day} 00:05:00")
end_time = pd.to_datetime(f"{end_day} 00:55:00")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min', name="timestamp")
#filling the nas and reformatting
df = df.set_index('timestamp')
df = df.reindex(dates)
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.reset_index(inplace=True)
Output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0
7 2020-01-01 00:40:00 xyz 0.0 42.0
8 2020-01-01 00:45:00 xyz 0.0 42.0
9 2020-01-01 00:50:00 xyz 0.0 42.0
10 2020-01-01 00:55:00 xyz 0.0 42.0
You can also use resample
here using asfreq()
, then do the filling as before. This is convenient for finding the dates (and should get around the timezone stuff):
# resample and then fill the gaps
# same df as constructed above
df = df.set_index('timestamp')
df.resample('5T').asfreq()
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.index.name='timestamp'
df.reset_index(inplace=True)
Same output:
timestamp flowers total_flowers user
0 2020-01-01 00:05:00 15 15.0 xyz
1 2020-01-01 00:10:00 5 20.0 xyz
2 2020-01-01 00:15:00 21 41.0 xyz
3 2020-01-01 00:20:00 0 41.0 xyz
4 2020-01-01 00:25:00 0 41.0 xyz
5 2020-01-01 00:30:00 0 41.0 xyz
6 2020-01-01 00:35:00 1 42.0 xyz
I couldn't find a way to do the filling during the resampling. For instance, using
df = df.resample('5T').agg({'flowers':'sum',
'total_flowers':'ffill',
'user':'ffill'})
does not work (it gets you to the same place as asfreq
, but there's more room for accidentally missing out columns here). Which is odd because when applying ffill
over the whole DataFrame, the missing data can be forward filled (but we only want that for some columns, and the user column also gets dropped). But simply using asfreq
and doing the filling after the fact seems fine to me with few columns.
Upvotes: 2
Reputation: 865
You are almost there:
df = pd.DataFrame({'user': ['xyz', 'xyz', 'xyz', 'xyz'],
'timestamp': ['01-01-2020 00:05:00', '01-01-2020 00:10:00', '01-01-2020 00:15:00', '01-01-2020 00:35:00'],
'flowers':[15, 5, 21, 1],
'total_flowers':[15, 20, 41, 42]
})
df['timestamp'] = pd.to_datetime(df['timestamp'])
r = pd.date_range(start=df['timestamp'].min(), end=df['timestamp'].max(), freq='5Min')
df = df.set_index('timestamp').reindex(r).rename_axis('timestamp').reset_index()
df['user'].ffill(inplace=True)
df['total_flowers'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
leads to the following output:
timestamp user flowers total_flowers
0 2020-01-01 00:05:00 xyz 15.0 15.0
1 2020-01-01 00:10:00 xyz 5.0 20.0
2 2020-01-01 00:15:00 xyz 21.0 41.0
3 2020-01-01 00:20:00 xyz 0.0 41.0
4 2020-01-01 00:25:00 xyz 0.0 41.0
5 2020-01-01 00:30:00 xyz 0.0 41.0
6 2020-01-01 00:35:00 xyz 1.0 42.0
Upvotes: 1