Alex Bausk
Alex Bausk

Reputation: 718

Efficient data structure to keep dictionary-like observations keyed by time in Pandas?

I have a list of observations tied to a datetime that could be represented either by a dict or by an object, something like this:

Datetime         | Data
--------------------------------------------------------
2018-01-20 20:00 | {'word1': 3, 'word2': 5, 'wordX': 1}
2018-01-20 20:01 | {'word1': 2, 'word2': 2, 'wordX': 5}
2018-01-20 20:05 | {'word3': 2, 'word6': 1, 'word1': 1}

The amount of observations can be in the order of millions. My main problem, as I see it, is that the amount of keys in the Data object varies.

I want to be able to aggregate the weighted words by various time intervals (for example, each 10 minutes, each 10 hours, etc.) by calculating the sum of weights assigned to specific words for those intervals.

I want to use Pandas for this but I'm at a loss of which data structure to choose since my date key is not evenly spaced and can be non-uniqie.

My first idea was to keep a separate dict of the dictionary objects, keyed by a UUID, and have a Pandas timeseries hold the mapping of the datetime to those UUIDs. That way, I could get the per-interval buckets of UUIDs and then manually iterate over them with my custom aggregation. Is this as bad an idea as it sounds? I'd really appreciate a kick in the right direction.

Upvotes: 1

Views: 88

Answers (1)

cs95
cs95

Reputation: 402303

Convert your Datetime column to datetime, if it isn't already:

df.Datetime = pd.to_datetime(df.Datetime)

Since you have a dictionary of counts, you can convert each one to a Counter object, because these support __add__ition and will work well with resample.

Set Datetime to the index, resample, and find the sum.

from collections import Counter
r = df.set_index('Datetime').Data.apply(Counter).resample('10min').sum()

print(r)

2018-01-20 20:00:00    {'word1': 6, 'word2': 7, 'wordX': 6, 'word3': ...
Freq: 10T, Name: Data, dtype: object

The given example resamples data every 10 minutes. This works only due to the nature of your data, having counts as the values. If it isn't, you'll need to write your own loop and get it done.

Upvotes: 2

Related Questions