jceg316
jceg316

Reputation: 489

How do I group by time in R and plot with ggplot? Can this be done within ggplot?

I'm analysing app data using R and I find myself having to group by time a lot so I can plot it in ggplot, however this doesn't seem easy to do.

my data looks like:

user_id | session_id | timestamp | time_seconds
 001    |   123      | 2014-01-01|    251
 002    |   845      | 2014-01-01|    514
 003    |   741      | 2014-01-02|    141
 003    |   477      | 2014-01-03|    221
 004    |   121      | 2014-01-03|    120
 005    |   921      | 2014-01-04|    60
...

The time_stamp column is formatted with as.Date() so it should be recognised as a date by R.

I need to plot line graphs showing no. of sessions over time in ggplot. Is there a simple way to do this within the ggplot code? for example:

ggplot(df, aes(timestamp,count(session_id)))+
  geom_line()

I want to do a count of sessions per date, the above code doesn't work, just an example to show what I'm after.

What I'd also like to do is then summarise by month. I'd also like to look into specific months and would like to subset the data. Can this be done from that line of code? xlim isn't what I'm after as that just "shortens" the axis.

I've tried using the aggregate function but with mixed results, not really what I've been after.

Thanks.

Upvotes: 1

Views: 2892

Answers (2)

kath
kath

Reputation: 7724

You can use group_by and summarize from the dplyr-package:

library(dplyr)
library(ggplot2)

df %>%  
  group_by(timestamp) %>% 
  summarise(session_count = n()) %>% 
  ggplot(aes(timestamp, session_count)) + 
  geom_line()

enter image description here

For summarizing the data by month you can do:

df %>%  
  mutate(month_timestamp = format(timestamp, "%b %Y")) %>% 
  group_by(month_timestamp) %>% 
  summarise(session_count = n()) %>% 
  ggplot(aes(month_timestamp, session_count)) + 
  geom_line()

The plot here doesn't show something because there's only one month in your data.

Data

df <- structure(list(user_id = c("001", "002", "003", "003", "004", "005"), 
                     session_id = c("123", "845", "741", "477", "121", "921"), 
                     timestamp = structure(c(16071, 16071, 16072, 16073, 16073, 16074), 
                                           class = "Date"), 
                     time_seconds = c(251, 514, 141, 221, 120, 60)), 
                .Names = c("user_id", "session_id", "timestamp", "time_seconds"), 
                class = c("tbl_df", "tbl", "data.frame"), 
                row.names = c(NA, -6L))

Upvotes: 1

erocoar
erocoar

Reputation: 5893

Might also be convenient to do with lubridate, e.g.

library(tidyverse)

dat <- data.frame(timestamp = rep(seq.Date(as.Date("2014/01/01"), as.Date("2014/12/24"), "day"), each = 2),
                  sessions = 1)

dat %>% 
  mutate(month = format(timestamp, "%Y-%m")) %>% 
  group_by(month) %>% 
  summarise(sum_session = sum(sessions)) %>% 
  ggplot(data = e, aes(x = month, y = sum_session, group = 1)) + geom_line()

Upvotes: 0

Related Questions