Reputation: 820
I have a dataframe containing tweets. I've got columns with information about the datetime, about a unique user_id and then columns indicating if the tweet belongs to a thematic category. In the end I'd like to visualize it with a line graph.
The data looks as follows:
datetime user_id Meta News & Media Environment ...
0 2019-05-08 07:16:02 21741359 NaN NaN 1.0
1 2019-05-08 07:15:23 2785265103 NaN NaN 1.0
2 2019-05-08 07:14:11 606785697 NaN 1.0 NaN
3 2019-05-08 07:13:42 718989200616529921 1.0 NaN NaN
4 2019-05-08 07:13:27 939207240728350720 1.0 NaN 1.0
... ... ... ... ... ...
So far I've managed to produce one just summing each theme per day with the following code:
monthly_trends = tweets_df.groupby(pd.Grouper(key='datetime', freq='D'))[list(issues.keys())].sum().fillna(0)
which gives me:
Meta News & Media Environment ...
datetime
2019-05-07 586.0 25.0 30.0
2019-05-08 505.0 16.0 70.0
2019-05-09 450.0 12.0 50.0
2019-05-10 339.0 8.0 90.0
2019-05-11 254.0 5.0 10.0
I plot this with:
monthly_trends.plot(kind='line', figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20)
plt.title('Issue activity during the election period', size = 30)
plt.show()
Which gives me a nice graph. But since one user may just be spamming one theme, I'd like to get a count of the frequency of unique users per theme per day. I've tried using additional groupby's but only got errors.
Upvotes: 0
Views: 436
Reputation: 107687
For pandas' DataFrame.plot
across multiple series you need data in wide format with separate columns. However, for unique user_id calculation you need data in long format for the aggregation. Therefore, consider melt
, groupby
, then pivot
back for plotting. Had you not needed a
### RESHAPE LONG AND AGGREGATE
long_df = (tweets_df.melt(id_vars=['datetime', 'user_id'],
value_name = 'Count', var_name = 'Issue')
.query("Count >= 1")
.groupby([pd.Grouper(key='datetime', freq='D'), 'Issue'])['user_id'].nunique()
.reset_index()
)
### RESHAPE WIDE AND PLOT
(long_df.pivot(index='datetime', columns='Issue', values='user_id')
.plot(kind='line', title='Unique Users by Day and Tweet Issue')
)
plt.show()
plt.clf()
plt.close()
Upvotes: 2
Reputation: 6642
Stack all issues, group by issue and day, and count the unique user ids:
df.columns.names = ['issue']
df_users = (df.set_index(['datetime', 'user_id'])[issues]
.stack()
.reset_index().groupby([pd.Grouper(key='datetime', freq='D'), 'issue'])
.apply(lambda x: len(x.user_id.unique()))
.rename('n_unique_users').reset_index())
print(df_users)
datetime issue n_unique_users
0 2019-05-08 Environment 3
1 2019-05-08 Meta 2
2 2019-05-08 News & Media 1
Then you can reshape as required for plotting:
df_users.pivot_table(index='datetime', columns='issue', values='n_unique_users', aggfunc=sum)
issue Environment Meta News & Media
datetime
2019-05-08 3 2 1
Upvotes: 1