Yuval Spiegler
Yuval Spiegler

Reputation: 373

How to flag first change in a variable value between years, per group?

Given a very large longitudinal dataset with different groups, I need to create a flag that indicates the first change in a certain variable (code) between years (year), per group (id). The type of observation within the same id-year just indicates different group members.

Sample data:

library(tidyverse)    
sample <- tibble(id = rep(1:3, each=6),
                     year = rep(2010:2012, 3, each=2),
                     type = (rep(1:2, 9)),
                     code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))

What I need is to flag the first change to code within a group, between years. Second changes do not matter. Missing codes ("") can be treated as NA but in any case should not affect flag. The following is the above tibble with a flag field as it should be:

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1           1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

I still have a looping mindset and I am trying to use vectorized dplyr to do what I need. Any input would be greatly appreciated!

EDIT: thanks for pointing this out regarding the importance of year. The id's are arranged by year, as the ordering is important here, and also all types per id per year need to have the same flag. So, in the edited row 15 e code is "" which would not warrant a change by itself, but since in the same year row 16 has a new code, both observations need to have their codes changed to 1.

Upvotes: 4

Views: 1131

Answers (3)

Jaap
Jaap

Reputation: 83225

A short solution with the data.table-package:

library(data.table)
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code)-1 > 0), by = id]

Or:

setDT(samp)[, flag := 0][code!="", flag := 1*(code!=code[1] & code!=''), by = id][]

which gives the desired result:

> samp
    id year type code flag
 1:  1 2010    1  abc    0
 2:  1 2010    2  abc    0
 3:  1 2011    1         0
 4:  1 2011    2         0
 5:  1 2012    1  xyz    1
 6:  1 2012    2  xyz    1
 7:  2 2010    1         0
 8:  2 2010    2         0
 9:  2 2011    1  lmn    0
10:  2 2011    2         0
11:  2 2012    1  efg    1
12:  2 2012    2  efg    1
13:  3 2010    1  def    0
14:  3 2010    2  def    0
15:  3 2011    1  klm    1
16:  3 2011    2  klm    1
17:  3 2012    1  nop    1
18:  3 2012    2  nop    1

Or when the year is relevant as well:

setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code, year)-1 > 0), id]

A possible base R alternative:

f <- function(x) {
  x <- rle(x)$lengths
  1 * (rep(seq_along(x), times=x) - 1 > 0)
}

samp$flag <- 0
samp$flag[samp$code!=''] <- with(samp[samp$code!=''], ave(as.character(code), id, FUN = f))

NOTE: it is better not to give your object the same name as functions.

Used data:

samp <- data.frame(id = rep(1:3, each=6),
                   year = rep(2010:2012, 3, each=2),
                   type = (rep(1:2, 9)),
                   code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","klm","klm","nop","nop"))

Upvotes: 2

akrun
akrun

Reputation: 887168

We can use data.table

library(data.table)
setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
#    id year type code flag
# 1:  1 2010    1  abc    0
# 2:  1 2010    2  abc    0
# 3:  1 2011    1         0
# 4:  1 2011    2         0
# 5:  1 2012    1  xyz    1
# 6:  1 2012    2  xyz    1
# 7:  2 2010    1         0
# 8:  2 2010    2         0
# 9:  2 2011    1  lmn    0
#10:  2 2011    2         0
#11:  2 2012    1  efg    1
#12:  2 2012    2  efg    1
#13:  3 2010    1  def    0
#14:  3 2010    2  def    0
#15:  3 2011    1  klm    1
#16:  3 2011    2  klm    1
#17:  3 2012    1  nop    1
#18:  3 2012    2  nop    1

Update

If we need to include the 'year' as well,

setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code, year)-1
                   cummax(rl*(rl < 2)) }, id]

Upvotes: 3

Adam Spannbauer
Adam Spannbauer

Reputation: 2757

possible solution using the dplyr. not sure its the cleanest way though

sample %>% 
  group_by(id) %>% 
  #find first year per group where code exists
  mutate(first_year = min(year[code != ""])) %>% 
  #gather all codes from first year (does not assume code is constant within year)
  mutate(first_codes = list(code[year==first_year])) %>% 
  #if year is not first year & code not in first year codes & code not blank
  mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>% 
  #drop created columns
  select(-first_year, -first_codes) %>% 
  ungroup()

output

# A tibble: 18 × 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1   klm     1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

Upvotes: 2

Related Questions