Sociopath
Sociopath

Reputation: 13426

Fill missing timestamps and apply different operations on different columns

I have a data in below format

user   timestamp             flowers     total_flowers

xyz   01-01-2020 00:05:00    15          15
xyz   01-01-2020 00:10:00    5           20
xyz   01-01-2020 00:15:00    21          41
xyz   01-01-2020 00:35:00    1           42

...

xyz   01-01-2020 11:45:00   57           1029
xyz   01-01-2020 11:55:00   18           1047

Expected Output:

user   timestamp             flowers     total_flowers

xyz   01-01-2020 00:05:00    15          15
xyz   01-01-2020 00:10:00    5           20
xyz   01-01-2020 00:15:00    21          41
xyz   01-01-2020 00:20:00    0           41
xyz   01-01-2020 00:25:00    0           41
xyz   01-01-2020 00:30:00    0           41
xyz   01-01-2020 00:35:00    1           42

...

xyz   01-01-2020 11:45:00   57           1029
xyz   01-01-2020 11:50:00   0            1029
xyz   01-01-2020 11:55:00   18           1047

So I want to fill timestamps with 5 mins interval and fill flowers column by 0 and total_flowers column by previous value(ffill)

My efforts:

start_day = "01-01-2020"
end_day = "01-01-2020"

start_time = pd.to_datetime(f"{start_day} 00:05:00+05:30")
end_time = pd.to_datetime(f"{end_day} 23:55:00+05:30")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min')

df =  df.set_index('timestamp').reindex(dates).reset_index(drop=False).reindex(columns=df.columns)

How do I fill flowers column with zeros and total_flower column with ffill and I am also getting values in timestamp column as Nan

Actual Output:

user   timestamp flowers     total_flowers

xyz   Nan    15          15
xyz   Nan    5           20
xyz   Nan    21          41
xyz   Nan    Nan         Nan
xyz   Nan    Nan         Nan
xyz   Nan    Nan         Nan
xyz   Nan    1           42

...

xyz   Nan   57           1029
xyz   Nan   Nan          Nan
xyz   Nan   18           1047

Upvotes: 2

Views: 518

Answers (2)

Tom
Tom

Reputation: 8800

Reindex and refill

If you construct the dates such that you can reindex your timestamps, you can then just do some fillna and ffill operations. I had to remove the timezone information, but you should be able to add that back if your data are timezone aware. Here's the full example using some of your data:

d = {'user': {0: 'xyz', 1: 'xyz', 2: 'xyz', 3: 'xyz'},
 'timestamp': {0: Timestamp('2020-01-01 00:05:00'),
  1: Timestamp('2020-01-01 00:10:00'),
  2: Timestamp('2020-01-01 00:15:00'),
  3: Timestamp('2020-01-01 00:35:00')},
 'flowers': {0: 15, 1: 5, 2: 21, 3: 1},
 'total_flowers': {0: 15, 1: 20, 2: 41, 3: 42}}

df = pd.DataFrame(d)

#  user           timestamp  flowers  total_flowers
#0  xyz 2020-01-01 00:05:00       15             15
#1  xyz 2020-01-01 00:10:00        5             20
#2  xyz 2020-01-01 00:15:00       21             41
#3  xyz 2020-01-01 00:35:00        1             42

#as you did, but with no TZ
start_day = "01-01-2020"
end_day = "01-01-2020"

start_time = pd.to_datetime(f"{start_day} 00:05:00")
end_time = pd.to_datetime(f"{end_day} 00:55:00")
dates = pd.date_range(start=start_time, end=end_time, freq='5Min', name="timestamp")

#filling the nas and reformatting
df = df.set_index('timestamp')
df = df.reindex(dates)
df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.reset_index(inplace=True)

Output:

             timestamp user  flowers  total_flowers
0  2020-01-01 00:05:00  xyz     15.0           15.0
1  2020-01-01 00:10:00  xyz      5.0           20.0
2  2020-01-01 00:15:00  xyz     21.0           41.0
3  2020-01-01 00:20:00  xyz      0.0           41.0
4  2020-01-01 00:25:00  xyz      0.0           41.0
5  2020-01-01 00:30:00  xyz      0.0           41.0
6  2020-01-01 00:35:00  xyz      1.0           42.0
7  2020-01-01 00:40:00  xyz      0.0           42.0
8  2020-01-01 00:45:00  xyz      0.0           42.0
9  2020-01-01 00:50:00  xyz      0.0           42.0
10 2020-01-01 00:55:00  xyz      0.0           42.0

Resample and refill

You can also use resample here using asfreq(), then do the filling as before. This is convenient for finding the dates (and should get around the timezone stuff):

# resample and then fill the gaps
# same df as constructed above
df = df.set_index('timestamp')
df.resample('5T').asfreq()

df['user'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)
df['total_flowers'].ffill(inplace=True)
df.index.name='timestamp'
df.reset_index(inplace=True)

Same output:

            timestamp  flowers  total_flowers user
0 2020-01-01 00:05:00       15           15.0  xyz
1 2020-01-01 00:10:00        5           20.0  xyz
2 2020-01-01 00:15:00       21           41.0  xyz
3 2020-01-01 00:20:00        0           41.0  xyz
4 2020-01-01 00:25:00        0           41.0  xyz
5 2020-01-01 00:30:00        0           41.0  xyz
6 2020-01-01 00:35:00        1           42.0  xyz

I couldn't find a way to do the filling during the resampling. For instance, using

df = df.resample('5T').agg({'flowers':'sum',
                            'total_flowers':'ffill',
                            'user':'ffill'})

does not work (it gets you to the same place as asfreq, but there's more room for accidentally missing out columns here). Which is odd because when applying ffill over the whole DataFrame, the missing data can be forward filled (but we only want that for some columns, and the user column also gets dropped). But simply using asfreq and doing the filling after the fact seems fine to me with few columns.

Upvotes: 2

TiTo
TiTo

Reputation: 865

crossed with @Tom

You are almost there:

df = pd.DataFrame({'user': ['xyz', 'xyz', 'xyz', 'xyz'],
    'timestamp': ['01-01-2020 00:05:00', '01-01-2020 00:10:00', '01-01-2020 00:15:00', '01-01-2020 00:35:00'],
                   'flowers':[15, 5, 21, 1],
                   'total_flowers':[15, 20, 41, 42]
})
df['timestamp'] = pd.to_datetime(df['timestamp'])
r = pd.date_range(start=df['timestamp'].min(), end=df['timestamp'].max(), freq='5Min')
df = df.set_index('timestamp').reindex(r).rename_axis('timestamp').reset_index()
df['user'].ffill(inplace=True)
df['total_flowers'].ffill(inplace=True)
df['flowers'].fillna(0, inplace=True)

leads to the following output:

    timestamp           user    flowers total_flowers
0   2020-01-01 00:05:00 xyz     15.0    15.0
1   2020-01-01 00:10:00 xyz      5.0    20.0
2   2020-01-01 00:15:00 xyz     21.0    41.0
3   2020-01-01 00:20:00 xyz      0.0    41.0
4   2020-01-01 00:25:00 xyz      0.0    41.0
5   2020-01-01 00:30:00 xyz      0.0    41.0
6   2020-01-01 00:35:00 xyz      1.0    42.0

Upvotes: 1

Related Questions