How to resample a dataframe to stretch from start to enddate in intervals (containing 0 for not available values)

Question

I have the following setup:

I have sparse information about queries hitting my endpoint at certain timepoints in a csv file. I parse this csv file with dates according to date_format='ISO8601' in the index column. Now what I want to do is this: I want to count the queries in certain intervals and put them into a dataframe that represents from start to enddate how many queries in said distinct intervals have hit the endpoint.

The problem is this: Using resample() I can aggregate and count the queries in the time intervals that contain information. But I can't find a way to extend this interval to always stretch from start to end date (with intervals filled with '0' by default).

I tried a combination of reindexing and resampling:

csv:

datetime,user,query
2024-03-02T00:00:00Z,user1,query1
2024-03-18T03:45:00Z,user1,query2
2024-03-31T12:01:00Z,user1,query3

myscript.py:

df = pd.read_csv(infile, sep=',', index_col='datetime', date_format='ISO8601', parse_dates=True)
df_timerange = df[start_date:end_date]
df_period = pd.date_range(start=start_date, end=end_date, freq='1M')
df_sampled = df_timerange['query'].resample('1M').count().fillna(0)
df_sampled = df_timerange.reindex(df_period)

However this will just produce a dataframe where index dates range from 2023-04-30T07:37:39.750Z to 2024-03-31T07:37:39.750Z in frequencies of 1 month, but the original data from the csv (df_timerange) is somehow not represented (all values are NaN)... Also I wonder why the dates start at this weird time: 07:37:39.750. My guess is that the reindexing didn't hit the timepoints where df_timerange contains values so they are just skipped? Or the timezone generated by pd.date_range() is not ISO8601 and this causes a mismatch.. Again, I'm not too experienced with panda dataframes to make sense of it.

Minimal reproducible example:

Run this with python 3.11:

from datetime import datetime, timezone
import pandas as pd

start_date = datetime(2023, 4, 15, 4, 1, 40, tzinfo=timezone.utc)
end_date = datetime(2024, 4, 15, 0, 0, 0, tzinfo=timezone.utc)

df = pd.read_csv('test.csv', sep=',', index_col='datetime', date_format='ISO8601', parse_dates=True)
df_timerange = df[start_date:end_date]
df_period = pd.date_range(start=start_date, end=end_date, freq='1M')

df_sampled = df_timerange['query'].resample('1M').count().fillna(0)
df_sampled = df_timerange.reindex(df_period)
print(df_sampled)

How to resample a dataframe to stretch from start to enddate in intervals (containing 0 for not available values)

Answers (1)

Related Questions