feyr
feyr

Reputation: 25

Conditional summarize of groups in dplyr based on date

I'm an R noob, and trying to perform a summary on a dataset which totals the number of event types for each ID that occurred between events of type 'B' for that ID. Here is a sample to illustrate:

id <- c('1', '1', '1', '2', '2', '2', '3', '3', '3', '3')
type <- c('A', 'A', 'B', 'A', 'B', 'C', 'A', 'B', 'C', 'B')
datestamp <- as.Date(c('2016-06-20','2016-07-16','2016-08-14','2016-07-17'
                       ,'2016-07-18','2016-07-19','2016-07-16','2016-07-19'
                       , '2016-07-21','2016-08-20'))
df <- data.frame(id, type, datestamp)

which produces:

> df
   id type  datestamp
1   1    A 2016-06-20
2   1    A 2016-07-16
3   1    B 2016-08-14
4   2    A 2016-07-17
5   2    B 2016-07-18
6   2    C 2016-07-19
7   3    A 2016-07-16
8   3    B 2016-07-19
9   3    C 2016-07-21
10  3    B 2016-08-20

Any time an event 'B' occurs, I want to know the number of each event type that occurred before that B event, but after any other B events for that ID. What I would like to end up with is a table like this:

  id type B_instance count
1  1    A          1     2
2  2    A          1     1
3  3    A          1     1
4  3    C          2     1

In researching, this question came the closest: summarizing a field based on the value of another field in dplyr

I've been trying to make this work:

  df2 <- df %>%
  group_by(id, type) %>%
  summarize(count = count(id[which(datestamp < datestamp[type =='B'])])) %>%
  filter(type != 'B')

But it errors out (also, even if it worked, it doesn't account for 2 'B' events in the same ID, like with id=3)

Upvotes: 1

Views: 1349

Answers (2)

akrun
akrun

Reputation: 887881

Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(df), grouped by 'id', we get the sequence of the max position where 'type' is 'B', find the row index (.I), extract that column ($V1). Then, we subset the dataset (df[i1]), remove the rows where 'type' is 'B", grouped by 'id', 'type' and the rleid of 'type', we get the number of rows as 'count'.

library(data.table)
i1 <- setDT(df)[, .I[seq(max(which(type=="B")))] , by = id]$V1
df[i1][type!="B"][,  .(count = .N), .(id, type, B_instance = rleid(type))]
#   id type B_instance count
#1:  1    A        1     2
#2:  2    A        1     1
#3:  3    A        1     1
#4:  3    C        2     1

Upvotes: 1

akuiper
akuiper

Reputation: 215117

You can use cumsum to create the new group variable B_instance by doing cumsum(type == "B") and then filter out types that fall behind the last B as well as type B itself, as they will not be counted. Then use count to count the occurrence with the group by id, B_instance and type.

df %>% 
       group_by(id) %>% 
       # create B_instance using cumsum on the type == "B" condition
       mutate(B_instance = cumsum(type == "B") + 1) %>%    
       # filter out rows with type behind the last B and all B types                 
       filter(B_instance < max(B_instance), type != "B") %>% 
       # count the occurrences of type grouped by id and B_instance
       count(id, type, B_instance) 

# Source: local data frame [4 x 4]
# Groups: id, type [?]

#       id   type B_instance     n
#   <fctr> <fctr>      <dbl> <int>
# 1      1      A          1     2
# 2      2      A          1     1
# 3      3      A          1     1
# 4      3      C          2     1

Upvotes: 0

Related Questions