Reputation: 640
I have a DataFrame with measurements of the following form:
label
2015-01-17 20:58:00.740000 cc
2015-01-19 04:36:00.740000 xy
2015-01-19 09:48:00.740000 ab
2015-01-19 09:52:00.740000 ab
2015-01-20 11:45:00.740000 ab
And want to resample it by days, create a new column with counts and aggregate the labels into a list. Such that I have the following result:
counts label
2015-01-17 1 [cc]
2015-01-18 0 []
2015-01-19 3 [ab, xy]
2015-01-20 1 [ab]
I'm new to pandas and don't know how to do it. I have read that DataFrame
supports lists as column types. I can count the days by DataFrame.resample()
and by sum
I can put the labels into one string. But this is not sufficient to produce the results.
I have generated the data with
from datetime import datetime, timedelta
from pandas import DataFrame, TimeGrouper
from random import randint, choice
n = 5
rnd_time = lambda: datetime.now() + timedelta(days=randint(0, 3), hours=randint(0, 24))
rnd_label = lambda: choice(['ab', 'cc', 'xyz'])
gen_times = [rnd_time() for _ in range(n)]
gen_labels = [rnd_label() for _ in range(n)]
df = DataFrame({'label': gen_labels}, index=gen_times)
So how can one produce the desired outcome?
Thank you in advance.
Upvotes: 2
Views: 1276
Reputation: 40973
You can do:
>>> df['counts'] = df.groupby(level=0).transform('count')
>>> df.resample('D', how={'counts': lambda x: x[0] if len(x) else 0,
'label' : lambda x: list(set(x))})
count label
2015-01-17 1 [cc]
2015-01-18 0 []
2015-01-19 3 [xy, ab]
2015-01-20 1 [ab]
EDIT: If the order of the elements is important then replace list(set(x))
with list(OrderedDict.fromkeys(x))
.
Upvotes: 3