Grouping by date and number of unique users for multiple variables

Question

I have a dataframe containing tweets. I've got columns with information about the datetime, about a unique user_id and then columns indicating if the tweet belongs to a thematic category. In the end I'd like to visualize it with a line graph.

The data looks as follows:

                 datetime             user_id  Meta  News & Media  Environment  ...
0     2019-05-08 07:16:02            21741359   NaN           NaN          1.0
1     2019-05-08 07:15:23          2785265103   NaN           NaN          1.0
2     2019-05-08 07:14:11           606785697   NaN           1.0          NaN
3     2019-05-08 07:13:42  718989200616529921   1.0           NaN          NaN
4     2019-05-08 07:13:27  939207240728350720   1.0           NaN          1.0
...                   ...                 ...   ...           ...          ...

So far I've managed to produce one just summing each theme per day with the following code:

monthly_trends = tweets_df.groupby(pd.Grouper(key='datetime', freq='D'))[list(issues.keys())].sum().fillna(0)

which gives me:

             Meta  News & Media   Environment  ...
datetime                                                                
2019-05-07  586.0          25.0          30.0      
2019-05-08  505.0          16.0          70.0      
2019-05-09  450.0          12.0          50.0     
2019-05-10  339.0           8.0          90.0               
2019-05-11  254.0           5.0          10.0

I plot this with:

monthly_trends.plot(kind='line', figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20)
plt.title('Issue activity during the election period', size = 30)
plt.show()

Which gives me a nice graph. But since one user may just be spamming one theme, I'd like to get a count of the frequency of unique users per theme per day. I've tried using additional groupby's but only got errors.

mcsoini · Accepted Answer

Stack all issues, group by issue and day, and count the unique user ids:

df.columns.names = ['issue']
df_users = (df.set_index(['datetime', 'user_id'])[issues]
              .stack()
              .reset_index().groupby([pd.Grouper(key='datetime', freq='D'), 'issue'])
              .apply(lambda x: len(x.user_id.unique()))
              .rename('n_unique_users').reset_index())
print(df_users)

    datetime         issue  n_unique_users
0 2019-05-08   Environment               3
1 2019-05-08          Meta               2
2 2019-05-08  News & Media               1

Then you can reshape as required for plotting:

df_users.pivot_table(index='datetime', columns='issue', values='n_unique_users', aggfunc=sum)

issue       Environment  Meta  News & Media
datetime                                   
2019-05-08            3     2             1

Grouping by date and number of unique users for multiple variables

Answers (2)

Related Questions