Reputation: 718
I have a list of observations tied to a datetime that could be represented either by a dict or by an object, something like this:
Datetime | Data
--------------------------------------------------------
2018-01-20 20:00 | {'word1': 3, 'word2': 5, 'wordX': 1}
2018-01-20 20:01 | {'word1': 2, 'word2': 2, 'wordX': 5}
2018-01-20 20:05 | {'word3': 2, 'word6': 1, 'word1': 1}
The amount of observations can be in the order of millions. My main problem, as I see it, is that the amount of keys in the Data
object varies.
I want to be able to aggregate the weighted words by various time intervals (for example, each 10 minutes, each 10 hours, etc.) by calculating the sum of weights assigned to specific words for those intervals.
I want to use Pandas for this but I'm at a loss of which data structure to choose since my date key is not evenly spaced and can be non-uniqie.
My first idea was to keep a separate dict of the dictionary objects, keyed by a UUID, and have a Pandas timeseries hold the mapping of the datetime to those UUIDs. That way, I could get the per-interval buckets of UUIDs and then manually iterate over them with my custom aggregation. Is this as bad an idea as it sounds? I'd really appreciate a kick in the right direction.
Upvotes: 1
Views: 88
Reputation: 402303
Convert your Datetime
column to datetime
, if it isn't already:
df.Datetime = pd.to_datetime(df.Datetime)
Since you have a dictionary of counts, you can convert each one to a Counter
object, because these support __add__
ition and will work well with resample.
Set Datetime
to the index, resample, and find the sum.
from collections import Counter
r = df.set_index('Datetime').Data.apply(Counter).resample('10min').sum()
print(r)
2018-01-20 20:00:00 {'word1': 6, 'word2': 7, 'wordX': 6, 'word3': ...
Freq: 10T, Name: Data, dtype: object
The given example resamples data every 10 minutes. This works only due to the nature of your data, having counts as the values. If it isn't, you'll need to write your own loop and get it done.
Upvotes: 2