Gregory Saxton
Gregory Saxton

Reputation: 1311

Cumulative Set in PANDAS

I have a dataframe of tweets and I'm looking to group the dataframe by date and generate a column that contains a cumulative list of all the unique users who have posted up to that date. None of the existing functions (e.g., cumsum) would appear to work for this. Here's a sample of the original tweet dataframe, where the index (created_at) is in datetime format:

In [3]: df
Out[3]: 
            screen_name 
created_at  
04-01-16    Bob 
04-01-16    Bob
04-01-16    Sally
04-01-16    Sally
04-02-16    Bob
04-02-16    Miguel
04-02-16    Tim

I can collapse the dataset by date and get a column with the unique users per day:

In [4]: df[['screen_name']].groupby(df.index.date).aggregate(lambda x: set(list(x)))

Out[4]:             from_user_screen_name
        2016-04-02  {Bob, Sally}
        2016-04-03  {Bob, Miguel, Tim}

So far so good. But what I'd like is to have a "cumulative set" like this:

Out[4]:             Cumulative_list_up_to_this_date   Cumulative_number_of_unique_users
        2016-04-02  {Bob, Sally}                      2
        2016-04-03  {Bob, Sally, Miguel, Tim}         4

Ultimately, what I am really interested in is the cumulative number in the last column so I can plot it. I've considered looping over dates and other things but can't seem to find a good way. Thanks in advance for any help.

Upvotes: 6

Views: 2731

Answers (3)

Pierre D
Pierre D

Reputation: 26221

The answer of @MykolaZotko is the only way that seems to work nowadays (in my case, pandas=2.2.1) in a groupby context. Furthermore, it can be applied directly to set.union in the accumulate function (here for a different goal than the OP's: do a cumulative union for each day):

from itertools import accumulate
from functools import partial


df = pd.DataFrame({
    'created_at': pd.to_datetime([
        '2016-04-01', '2016-04-01', '2016-04-01', '2016-04-01',
        '2016-04-02', '2016-04-02', '2016-04-02']),
    'screen_name': ['Bob', 'Bob', 'Sally', 'Sally', 'Bob', 'Miguel', 'Tim'],
}).set_index('created_at')


cum_union = partial(accumulate, func=set.union)
z = df.assign(
    cumset=df['screen_name']
    .apply(lambda x: {x})
    .groupby(pd.Grouper(freq='D'))
    .transform(cum_union)
)

>>> z
           screen_name              cumset
created_at                                
2016-04-01         Bob               {Bob}
2016-04-01         Bob               {Bob}
2016-04-01       Sally        {Bob, Sally}
2016-04-01       Sally        {Bob, Sally}
2016-04-02         Bob               {Bob}
2016-04-02      Miguel       {Bob, Miguel}
2016-04-02         Tim  {Bob, Tim, Miguel}

Upvotes: 0

Mykola Zotko
Mykola Zotko

Reputation: 17834

You can use the accumulate function as a parameter to transfrom:

from itertools import accumulate

df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2, 2], 'col2': [1, 2, 3, 1, 2, 3, 4]})

df['col3'] = df['col2'].apply(lambda x: [x])
df['col3'] = df.groupby('col1')['col3'].transform(accumulate)

Result:

   col1  col2          col3
0     1     1           [1]
1     1     2        [1, 2]
2     1     3     [1, 2, 3]
3     2     1           [1]
4     2     2        [1, 2]
5     2     3     [1, 2, 3]
6     2     4  [1, 2, 3, 4]

To convert 'col3' to set use df['col3'].apply(set).

Upvotes: 1

Alicia Garcia-Raboso
Alicia Garcia-Raboso

Reputation: 13913

With recent version of Pandas, this does not work anymore and results in the following error: NotImplementedError: function is not implemented for this dtype: [how->cumsum,dtype->object]

You cannot add sets, but can add lists! So build a list of users, then take the cumulative sum and finally apply the set constructor to get rid of duplicates.

cum_names = (df['screen_name'].groupby(df.index.date)
                              .agg(lambda x: list(x))
                              .cumsum()
                              .apply(set))
# 2016-04-01                 {Bob, Sally}
# 2016-04-02    {Bob, Miguel, Tim, Sally}
# dtype: object

cum_count = cum_names.apply(len)
# 2016-04-01    2
# 2016-04-02    4
# dtype: int64

Upvotes: 9

Related Questions