aiorr
aiorr

Reputation: 599

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.

test <- data.frame(
  grp = c("A", "A", "A", "B", "B", "B"),
  value = c(1, 3, 5, 1, 3, 5),
  threshold = c(4,4,4,2,2,2)
)

want <- data.frame(
  grp = c("A", "A", "A", "B", "B", "B"),
  value = c(1, 3, 5, 1, 3, 5),
  threshold = c(4,4,4,2,2,2),
  want = c(NA, NA, "yes", NA, "yes", NA)
)

In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.

enter image description here

I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:

library(dplyr)
test %>%
  group_by(grp) %>%
  mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
  mutate(across(want, ~replace(.x, duplicated(.x), NA)))

enter image description here

I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:

test %>%
  group_by(grp) %>%
  mutate(want = if_else(???, "yes", NA_character_))

The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?

Thank you!

Upvotes: 1

Views: 832

Answers (3)

thelatemail
thelatemail

Reputation: 93938

This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:

library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
#      grp value threshold   flag
#   <char> <num>     <num> <lgcl>
#1:      A     1         4     NA
#2:      A     3         4     NA
#3:      A     5         4   TRUE
#4:      B     1         2     NA
#5:      B     3         2   TRUE
#6:      B     5         2     NA

Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

Upvotes: 1

TarJae
TarJae

Reputation: 79246

Here is more direct way:

The essential part: With min(which((value > threshold) == TRUE) we get the first TRUE in our column,

Next we use an ifelse and check the number we get to the row number and set the conditions:

library(dplyr)

test %>%
  group_by(grp) %>% 
  mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
                       "yes", NA_character_))
 grp   value threshold want 
  <chr> <dbl>     <dbl> <chr>
1 A         1         4 NA   
2 A         3         4 NA   
3 A         5         4 yes  
4 B         1         2 NA   
5 B         3         2 yes  
6 B         5         2 NA   
> 

Upvotes: 2

r2evans
r2evans

Reputation: 160952

library(dplyr)
test %>%
  group_by(grp) %>%
  mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
  ungroup()
# # A tibble: 6 × 4
#   grp   value threshold want 
#   <chr> <dbl>     <dbl> <lgl>
# 1 A         1         4 FALSE
# 2 A         3         4 FALSE
# 3 A         5         4 TRUE 
# 4 B         1         2 FALSE
# 5 B         3         2 TRUE 
# 6 B         5         2 FALSE

If you really want strings, you can if_else after this.

Upvotes: 3

Related Questions