Why do I get different frequencies depending of the time I apply group_by() and distinct() in R?

Question

I am quite new to R and the tidyverse, and I can't wrap my head around the following:

Why do I get a different frequencies depending on when I group_by() and distinct() my data?

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  group_by(created_at) %>%
  count(created_at)

output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  group_by(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

full_join(output_df_1 , output_df_2 , by = "created_at") %>%
  rename(output_df_1 = n.x,
         output_df_2 = n.y) %>%
  melt(id = "created_at") %>%
  ggplot()+
  geom_line(aes(x=created_at, y=value, colour=variable),
            linetype = "solid",
            size = 0.75) +
  scale_colour_manual(values=c("#005293","#E37222"))

Context

input_df is a dataframe containing observations of tweets with timestamps and author_ids. I would like to produce a plot with variable1 being tweets per hour (this poses no problem) and variable2 being distict users per hour. I am not sure which of the two lines in the above plot correcly visualizes the distinct users per hour.

TarJae · Accepted Answer

It is because in the first code, you use distinct before group_by and count.
Morover it is the use of group_by. count automatically also groups: count is same as group_by(cyl) %>% summarise(freq=n()).

Here is an example:

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>%
  count(cyl)

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

gives:

> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>%
+   count(cyl)
  cyl n
1   6 2
> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2

If you change the order of distinct:

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

mtcars %>% 
  count(cyl) %>% 
  distinct(am, .keep_all=TRUE)

you get:

 mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2
> 
> mtcars %>% 
+   count(cyl) %>% 
+   distinct(am, .keep_all=TRUE)
Error: `distinct()` must use existing variables.
x `am` not found in `.data`.

In your example, this code should give the same result for df1 and df2:

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)



output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

Why do I get different frequencies depending of the time I apply group_by() and distinct() in R?

Answers (1)

Related Questions