gecko
gecko

Reputation: 153

Check if conditions are met in any previous row in the group R

Let's say I have this data frame in R:

x <-data.frame(group=c("A","A","A", "B","B", "C","C","C"), cond=c("g1", "g2", "g1", "g3", "g2",  "g2", "g1", "g2"), order=c(1,2,3,2,1,2,3,1))
  group cond order
     A   g1     1
     A   g2     2
     A   g1     3
     B   g3     2
     B   g2     1
     C   g2     2
     C   g1     3
     C   g2     1

I want order groups according to the order variable to make two variables: v1) yes/no whether cond=="g1" for any of the previous rows in a group, and v2) yes/no whether, for a given row, there is a previous row in the group with a different value for cond (v2=no when group=C and order=1 or 2 but v2=yes when group=C and order=3). I know I can use dplyr to arrange the groups:

x %>% group_by(group) %>% arrange(order, .by_group=T)

but I don't know where to go from there since lag() and lead() are only for the row immediately before or after.

Upvotes: 3

Views: 2255

Answers (2)

gecko
gecko

Reputation: 153

Here is a method that works but is definitely not the most efficient:

x <-data.frame(group=c("A","A","A", "B","B", "C","C","C"), 
               cond=c("g1", "g2", "g1", "g3", "g2",  "g2", "g1", "g2"), 
               order=c(1,2,3,2,1,2,3,1))

x<-x %>% group_by(group) %>% 
  mutate(cond_1=lag(cond),
         cond_2=ifelse(order>2, lag(cond_1), NA),
         cond_2=ifelse(order>3, lag(cond_1), NA)
         ) %>% ungroup()

#doublecheck there are enough lag variables
max_order<-max(x$order, na.rm = T)
lag_var<-sum(grepl("cond_", names(x)))
if (lag_var<(max_order-1)){
  stop("Add more lag variables (cond_#) to make sure you get all previous conditions")
} else { #do nothing
}

x <- x %>% 
  mutate(v1=case_when(order !=1 ~ as.numeric(rowSums(
    sapply(x %>% select(., starts_with("cond_")), 
           function(x) grepl("^g1$", x, ignore.case = TRUE)  ))>0)),
    v2=as.numeric(apply(x %>% select(., starts_with("cond")), 1, 
function(x) length(unique(na.omit(x)))>1)))

Upvotes: 0

zephryl
zephryl

Reputation: 17069

  1. Generate one-hot codes for cond. (I use pivot_wider for this.)
  2. When the cumulative sum for a cond code > 0, it means that value has occurred in the current or a previous row. We can exclude the current row by using dplyr::lag.
  3. Summing across the one-hot codes tells us how many unique cond values have been seen so far. If this is >1, then we know a previous row must have had a different value from the current row.
  4. Clean up by selecting away the one-hot codes and other helper columns.
library(dplyr)
library(tidyr)

x <-data.frame(
  group=c("A","A","A", "B","B", "C","C","C"), 
  cond=c("g1", "g2", "g1", "g3", "g2",  "g2", "g1", "g2"), 
  order=c(1,2,3,2,1,2,3,1)
)

x2 <- x %>% 
  mutate(cond_pivot = cond, cond_value = 1) %>% 
  pivot_wider(
    names_from = cond_pivot, 
    values_from = cond_value, 
    values_fill = 0
  ) %>% 
  group_by(group) %>% 
  arrange(order, .by_group = TRUE) %>% 
  mutate(
    across(g1:g3, ~ cumsum(.x) > 0, .names = "{.col}_seen_yet"),
    n_seen_so_far = rowSums(across(g1_seen_yet:g3_seen_yet)),
    v1 = dplyr::lag(g1_seen_yet, default = FALSE),
    v2 = n_seen_so_far > 1
  ) %>% 
  select(group:order, v1:v2)

Output:

# A tibble: 8 x 5
# Groups:   group [3]
  group cond  order v1    v2   
  <chr> <chr> <dbl> <lgl> <lgl>
1 A     g1        1 FALSE FALSE
2 A     g2        2 TRUE  TRUE 
3 A     g1        3 TRUE  TRUE 
4 B     g2        1 FALSE FALSE
5 B     g3        2 FALSE TRUE 
6 C     g2        1 FALSE FALSE
7 C     g2        2 FALSE FALSE
8 C     g1        3 FALSE TRUE 

Upvotes: 2

Related Questions