Reputation: 1311
I have a dataframe of tweets and I'm looking to group the dataframe by date and generate a column that contains a cumulative list of all the unique users who have posted up to that date. None of the existing functions (e.g., cumsum) would appear to work for this. Here's a sample of the original tweet dataframe, where the index (created_at) is in datetime format:
In [3]: df
Out[3]:
screen_name
created_at
04-01-16 Bob
04-01-16 Bob
04-01-16 Sally
04-01-16 Sally
04-02-16 Bob
04-02-16 Miguel
04-02-16 Tim
I can collapse the dataset by date and get a column with the unique users per day:
In [4]: df[['screen_name']].groupby(df.index.date).aggregate(lambda x: set(list(x)))
Out[4]: from_user_screen_name
2016-04-02 {Bob, Sally}
2016-04-03 {Bob, Miguel, Tim}
So far so good. But what I'd like is to have a "cumulative set" like this:
Out[4]: Cumulative_list_up_to_this_date Cumulative_number_of_unique_users
2016-04-02 {Bob, Sally} 2
2016-04-03 {Bob, Sally, Miguel, Tim} 4
Ultimately, what I am really interested in is the cumulative number in the last column so I can plot it. I've considered looping over dates and other things but can't seem to find a good way. Thanks in advance for any help.
Upvotes: 6
Views: 2731
Reputation: 26221
The answer of @MykolaZotko is the only way that seems to work nowadays (in my case, pandas=2.2.1
) in a groupby
context. Furthermore, it can be applied directly to set.union
in the accumulate
function (here for a different goal than the OP's: do a cumulative union for each day):
from itertools import accumulate
from functools import partial
df = pd.DataFrame({
'created_at': pd.to_datetime([
'2016-04-01', '2016-04-01', '2016-04-01', '2016-04-01',
'2016-04-02', '2016-04-02', '2016-04-02']),
'screen_name': ['Bob', 'Bob', 'Sally', 'Sally', 'Bob', 'Miguel', 'Tim'],
}).set_index('created_at')
cum_union = partial(accumulate, func=set.union)
z = df.assign(
cumset=df['screen_name']
.apply(lambda x: {x})
.groupby(pd.Grouper(freq='D'))
.transform(cum_union)
)
>>> z
screen_name cumset
created_at
2016-04-01 Bob {Bob}
2016-04-01 Bob {Bob}
2016-04-01 Sally {Bob, Sally}
2016-04-01 Sally {Bob, Sally}
2016-04-02 Bob {Bob}
2016-04-02 Miguel {Bob, Miguel}
2016-04-02 Tim {Bob, Tim, Miguel}
Upvotes: 0
Reputation: 17834
You can use the accumulate
function as a parameter to transfrom
:
from itertools import accumulate
df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 2, 2], 'col2': [1, 2, 3, 1, 2, 3, 4]})
df['col3'] = df['col2'].apply(lambda x: [x])
df['col3'] = df.groupby('col1')['col3'].transform(accumulate)
Result:
col1 col2 col3
0 1 1 [1]
1 1 2 [1, 2]
2 1 3 [1, 2, 3]
3 2 1 [1]
4 2 2 [1, 2]
5 2 3 [1, 2, 3]
6 2 4 [1, 2, 3, 4]
To convert 'col3'
to set use df['col3'].apply(set)
.
Upvotes: 1
Reputation: 13913
With recent version of Pandas, this does not work anymore and results in the following error: NotImplementedError: function is not implemented for this dtype: [how->cumsum,dtype->object]
You cannot add sets, but can add lists! So build a list of users, then take the cumulative sum and finally apply the set constructor to get rid of duplicates.
cum_names = (df['screen_name'].groupby(df.index.date)
.agg(lambda x: list(x))
.cumsum()
.apply(set))
# 2016-04-01 {Bob, Sally}
# 2016-04-02 {Bob, Miguel, Tim, Sally}
# dtype: object
cum_count = cum_names.apply(len)
# 2016-04-01 2
# 2016-04-02 4
# dtype: int64
Upvotes: 9