Ludovica
Ludovica

Reputation: 65

Pandas aggregate list in resample/groupby

I have a dataframe in which each instance has a timestamp, an id and a list of numbers as follows:

timestamp           | id | lists
----------------------------------
2016-01-01 00:00:00 | 1  | [2, 10]
2016-01-01 05:00:00 | 1  | [9, 10, 3, 5]
2016-01-01 10:00:00 | 1  | [1, 10, 5]
2016-01-02 01:00:00 | 1  | [2, 6, 7]
2016-01-02 04:00:00 | 1  | [2, 6]
2016-01-01 02:00:00 | 2  | [0]
2016-01-01 08:00:00 | 2  | [10, 3, 2]
2016-01-01 14:00:00 | 2  | [0, 9, 3]
2016-01-02 03:00:00 | 2  | [0, 9, 2]

For each id I want to resample by day(and this is easy) and concatenate all the lists of the instances that happened in the same day. Resample + concat/sum does not work because resample removes all non-numeric columns (see here)

I want to write something similar to this:

daily_data = data.groupby('id').resample('1D').concatenate() # .concatenate() does not exist

Result desired:

timestamp  | id | lists
----------------------------------
2016-01-01 | 1  | [2, 10, 9, 10, 3, 5, 1, 10, 5]
2016-01-02 | 1  | [2, 6, 7, 2, 6]
2016-01-01 | 2  | [0, 10, 3, 2]
2016-01-02 | 2  | [0, 9, 3, 0, 9, 2]

Here you can copy a script that generates the input I used for the description:

import pandas as pd 
from random import randint

time = pd.to_datetime( ['2016-01-01 00:00:00', '2016-01-01 05:00:00', 
                        '2016-01-01 10:00:00', '2016-01-02 01:00:00', 
                        '2016-01-02 04:00:00', '2016-01-01 02:00:00', 
                        '2016-01-01 08:00:00', '2016-01-01 14:00:00',
                        '2016-01-02 03:00:00' ]
                      )

id_1 = [1] * 5
id_2 = [2] * 4

lists = [0] * 9
for i in range(9):
    l = [randint(0,10)  for _ in range(randint(1,5) ) ]
    l = list(set(l))
    lists[i] = l

data = {'timestamp': time, 'id': id_1 + id_2, 'lists': lists}

example = pd.DataFrame(data=data)

Bonus points if there is a way to optionally remove duplicates in the concatenated list.

Upvotes: 3

Views: 1492

Answers (2)

NickBraunagel
NickBraunagel

Reputation: 1599

for the unique count of each list item use list comprehension:

a = [list(set(l)) for l in df.lists]
df.loc[:,'lists'] = a

Upvotes: 0

piRSquared
piRSquared

Reputation: 294258

As pointed out by @jezrael, this only works in pandas version 0.18.1+

  • set_index with 'timestamp' to prep for later resample
  • groupby 'id' column and select lists columns
  • after resample, sum of lists will concatenate them
  • reset_index to get columns in correct order

df.set_index('timestamp').groupby('id').lists.resample('D').sum() \
  .reset_index('id').reset_index()

enter image description here

Upvotes: 7

Related Questions