Reputation: 57
I've got to do a line plot that consists of: x = hour of day, y = (normalized) number of tweets on that hour, considering only tweets from X month, Each line represents a month.
My dataframe is in this format (i've got more columns but they're not relevant for this):
id_tweet day month hour minute id_user
550654742654103552 01 01 12 08 174744462
550654753106296832 01 01 12 08 15355832
550654818935910400 01 01 12 08 628822209
550654823667089409 01 01 12 08 283218297
550654824308813824 01 01 12 09 58315346
I want to know how many percent of people tweet on January, or July, or anything like that.
The problem is that my data is very biased, there was a change in the collection algorithm and I've got a lot more data for months 1 ~ 4 then for the rest. My data distribution is shown on the image below:
Long story short, I need to sum all tweets that were tweeted at each hour of day and divide by the total number of tweets from January. That would be line 1 for the graph.
Line 2 would be all tweets that were tweeted at each hour of day and divide by the total number of tweets from February, and so on.
Hope I was clear and I thank in advance any help I can get.
Upvotes: 1
Views: 150
Reputation: 145965
You can use dplyr
to aggregate your data:
library(dplyr)
agg_data = your_data %>%
group_by(month, day, hour) %>%
summarize(n_hour = n()) %>%
group_by(month) %>%
mutate(percent_of_month = n_hour / sum(n_hour))
I'll leave the plotting to you.
Upvotes: 1