Reputation: 4896
I have the following setup:
I have sparse information about queries hitting my endpoint at certain timepoints in a csv file. I parse this csv file with dates according to date_format='ISO8601'
in the index column. Now what I want to do is this: I want to count the queries in certain intervals and put them into a dataframe that represents from start to enddate how many queries in said distinct intervals have hit the endpoint.
The problem is this: Using resample() I can aggregate and count the queries in the time intervals that contain information. But I can't find a way to extend this interval to always stretch from start to end date (with intervals filled with '0' by default).
I tried a combination of reindexing and resampling:
csv:
datetime,user,query
2024-03-02T00:00:00Z,user1,query1
2024-03-18T03:45:00Z,user1,query2
2024-03-31T12:01:00Z,user1,query3
myscript.py:
df = pd.read_csv(infile, sep=',', index_col='datetime', date_format='ISO8601', parse_dates=True)
df_timerange = df[start_date:end_date]
df_period = pd.date_range(start=start_date, end=end_date, freq='1M')
df_sampled = df_timerange['query'].resample('1M').count().fillna(0)
df_sampled = df_timerange.reindex(df_period)
However this will just produce a dataframe where index dates range from 2023-04-30T07:37:39.750Z
to 2024-03-31T07:37:39.750Z
in frequencies of 1 month, but the original data from the csv (df_timerange
) is somehow not represented (all values are NaN)... Also I wonder why the dates start at this weird time: 07:37:39.750
. My guess is that the reindexing didn't hit the timepoints where df_timerange
contains values so they are just skipped? Or the timezone generated by pd.date_range() is not ISO8601 and this causes a mismatch.. Again, I'm not too experienced with panda dataframes to make sense of it.
Minimal reproducible example:
Run this with python 3.11:
from datetime import datetime, timezone
import pandas as pd
start_date = datetime(2023, 4, 15, 4, 1, 40, tzinfo=timezone.utc)
end_date = datetime(2024, 4, 15, 0, 0, 0, tzinfo=timezone.utc)
df = pd.read_csv('test.csv', sep=',', index_col='datetime', date_format='ISO8601', parse_dates=True)
df_timerange = df[start_date:end_date]
df_period = pd.date_range(start=start_date, end=end_date, freq='1M')
df_sampled = df_timerange['query'].resample('1M').count().fillna(0)
df_sampled = df_timerange.reindex(df_period)
print(df_sampled)
Upvotes: 2
Views: 106
Reputation: 1
I used resample()
to count unique queries and date_range()
to generate the full interval of months for an arbitrary start and end date. These are the two main parts that we needed. Then resample the dataframe to make sure all dates in ragne are included, and then convert the date range to period range to make indexes consistent.
import pandas as pd
# some constant values to represent the timeframe
start_date = '2024-03-01'
end_date = '2025-03-31'
df = pd.read_csv('file.csv', sep=',', index_col='datetime', parse_dates=['datetime'])
df.index = df.index.tz_convert('UTC')
df_timerange = df[start_date:end_date]
df_sampled = df_timerange.resample('M').agg({'user': 'nunique', 'query': 'nunique'})
full_date_range = pd.date_range(start=start_date, end=end_date, freq='M')
df_sampled.index = df_sampled.index.to_period('M')
full_period_range = full_date_range.to_period('M')
df_sampled = df_sampled.reindex(full_period_range, fill_value='NaN')
print("Adjusted Monthly Data:")
print(df_sampled.head(20))
Also, it looks like the timestamps are in UTC. Is this the correct timezone that you were expecting? If not, you can convert it using df.index.tz_convert()
, which could explain the messed up timings. Let me know if I understood correctly!
Upvotes: 0