Counting the number of occurrences of a combination of values in r

I'm working with data concerning different cases going through a proces consisting of different fases during a certain period in time. Each case has an unique id number. A proces can start in multiple fases and ends in fase "Finished" (except for still ungoing fases). A case can go through a proces multiple times. The data looks similar to this:

library(dplyr)
df1 <- structure(list(id = c("1", "1", "2", "2", "2", "2", "3", "3", 
"3", "3", "3", "3", "3", "3", "3", "3"), time = structure(c(17453, 
17458, 17453, 17462, 17727, 17735, 17453, 17484, 17568, 17665, 
17665, 17709, 17727, 17727, 17757, 17819), class = "Date"), old_fase = 
c(NA, "Fase 1", NA, "Fase 1", "Finished", "Fase 1", NA, "Fase 1", "Fase 2A", 
"Finished", "Fase 2A", "Fase 2B", "Finished", "Fase 2B", "Fase 1", 
"Fase 2A"), new_fase = c("Fase 1", "Finished", "Fase 1", "Finished", 
"Fase 1", "Finished", "Fase 1", "Fase 2A", "Finished", "Fase 2A", 
"Fase 2B", "Finished", "Fase 2B", "Fase 1", "Fase 2A", "Fase 2B"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -16L))

For my analysis I want to create a new id based on the occurrence of each proces per id. Using group_by and mutate on "id" and "new_fase" creates the following incorrect solution. This happens because of the first occurrence of "Fase 2B" in row 11.

df1 %>% 
group_by(id,new_fase) %>% 
mutate(occurrence=row_number())

The correct solution should look like this:

df1 %>% 
mutate(occurrence = c(rep(1, 4), 2, 2, rep(1, 3), rep(2, 3), rep(3, 4)))

I tried multiple approaches and read multiple Stackoverflow posts, but I am not able to figure it out correctly. Any help is appreciated, preferably using a tidyverse solution.

Upvotes: 3

Views: 812

Answers (3)

I found this temporary solution (thanks to iod's solution on the first example using group_by and mutate).

df1 %>% filter(is.na(old_fase) | old_fase == "Finished") %>% # indicates the beginning of a new proces
group_by(id) %>% 
mutate(occurrence = row_number()) %>% 
select(id, time, occurrence) %>% 
left_join(df1, ., by = c("id", "time")) %>% 
fill(occurrence)

Upvotes: 0

akrun
akrun

Reputation: 886978

We can use ave from base R

df2$occurrence <- with(df2, ave(seq_along(id), id, fase, FUN = seq_along))

Or with data.table

library(data.table)
setDT(df2)[, occurrence := seq_len(.N), .(id, fase)]

Upvotes: 3

iod
iod

Reputation: 7592

df3<- df1 %>% 
  group_by(id,fase) %>% 
  mutate(occurrence=row_number())

df3
# A tibble: 18 x 4
# Groups:   id, fase [9]
      id fase  time       occurrence
   <dbl> <chr> <date>          <int>
 1     1 a     2018-01-01          1
 2     1 b     2018-01-02          1
 3     1 c     2018-01-03          1
 4     2 a     2018-01-01          1
 5     2 b     2018-01-02          1
 6     2 c     2018-01-03          1
 7     2 a     2018-01-04          2
 8     2 b     2018-01-05          2
 9     2 c     2018-01-06          2
10     2 a     2018-01-07          3
11     2 b     2018-01-08          3
12     2 c     2018-01-09          3
13     3 a     2018-01-01          1
14     3 b     2018-01-02          1
15     3 c     2018-01-03          1
16     3 a     2018-01-04          2
17     3 b     2018-01-05          2
18     3 c     2018-01-06          2

all(df2==df3)
[1] TRUE

You break down (group) the df into parts where each part has the same id and phase, and then you simply number the rows in each of these parts. Note this assumes the df is already sorted chronologically, as in your sample data. If this is not true, you'll have to sort it in advance by time.

Upvotes: 2

Related Questions