Mactilda
Mactilda

Reputation: 393

Using group_by to count difference in values

I have a large df like the one below, where I want to know (using the terms in the made up df) know which id that have the same fruit for the longest period of time in this biannually event. I.e. the opportunity to hold a fruit only occurs every other year.

 df<-data.frame("id"=c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3), 
  "Year"=c(1981, 1981, 1985, 2011, 2011, 2013, 2015, 1921, 1923, 1955), 
 "fruit"=c("banana", "apple", "banana", "orange", "melon", "orange", 
            "orange", "melon", "melon", "melon"))

I have tried different kinds of group_by, and cumsum see below.

df<-df %>% mutate(year_diff=cumsum(c(1, diff(df$Year)>1))) 
df %>% group_by(id, fruit) %>% filter(year_diff==2)

And the one below (after reloading the df)

df %>% group_by(id, fruit) %>% mutate(year_diff=cumsum(c(1, diff(df$Year)>1)))

And played around with:

df %>% group_by(id, fruit) %>% mutate(summarise(n_years=n_distinct(Year)))    

In the end I ideally want a tibble like the one below arranging the id's (with their fruits) in order of who have the most consecutive "holds" of a fruit in the events (over time). Remember that the event only takes place every other year.

id     fruit    occurence 
2      orange    3
3      melon     2
1      banana    1
1      apple     1
2      melon     1 
3      melon     1  

I understand that there are several steps.

EDIT: Maybe there is a way to modify this:

df %>% group_by(id, fruit) %>% summarise(n_years=n_distinct(Year)) %>% arrange(desc(n_years)) %>% ungroup()

so that it creates a new column in the original tibble (which I am unable to do, but you might be), and then I can filter consecutive events?

Upvotes: 0

Views: 86

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389265

Using dplyr we arrange rows by id, fruit and Year and create a new grouping variable (group) showing consecutive years for each id and fruit and then count the number of rows in each group.

library(dplyr)

df %>%
  arrange(id, fruit, Year) %>%
  group_by(id, fruit, group = cumsum(c(2, diff(Year)) != 2)) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  select(-group)


#    id fruit      n
#  <dbl> <fct>  <int>
#1     1 apple      1
#2     1 banana     1
#3     1 banana     1
#4     2 melon      1
#5     2 orange     3
#6     3 melon      2
#7     3 melon      1

Upvotes: 2

Related Questions