Vicki Latham
Vicki Latham

Reputation: 67

Group and then mutate

I have a dataset where I want to select only one row for each individual each year - however, I would like to mutate a column so that if it says 'yes' for any of that persons rows then all the rows say 'yes'.

This is an example of the dataset I have:

enter image description here

So where the name, clinic and year are the same, I want the tested column to say 'yes' if any of the other rows for that grouping say 'yes'.

Therefore, this is what I would want the dataset to finally look like:

enter image description here

Upvotes: 0

Views: 64

Answers (2)

L Tyrone
L Tyrone

Reputation: 7075

I have included an extra patient who visited two different clinics to demonstrate what happens in those cases as well. This dplyr solution works by checking if any() values in each group_by() are "yes", and then modifies the tested values if true.

library(dplyr)

df <- data.frame(name = paste0("patient ", rep(c("x", "xy", "xyz", "abc"), each = 2)),
                 clinic = c(rep(c("xxy", "ggf", "ffr"), each = 2), "hhx", "hhy"),
                 year = c(rep(c(2022, 2019), each = 2), 2018, 2019, 2017, 2017),
                 date = month.name[c(4,5,1,2,3,5,6,7)],
                 tested = c("yes", "no", "no", "yes", "yes", "no", "yes", "no"))

result <- df %>%
  group_by(name, clinic, year) %>%
  mutate(tested = case_when(any(tested == "yes") ~ "yes", TRUE ~ tested)) %>%
  ungroup()


result
# A tibble: 8 × 5
  name        clinic  year date     tested
  <chr>       <chr>  <dbl> <chr>    <chr> 
1 patient x   xxy     2022 April    yes   
2 patient x   xxy     2022 May      yes   
3 patient xy  ggf     2019 January  yes   
4 patient xy  ggf     2019 February yes   
5 patient xyz ffr     2018 March    yes   
6 patient xyz ffr     2019 May      no    
7 patient abc hhx     2017 June     yes   
8 patient abc hhy     2017 July     no 

Upvotes: 1

Josep Pueyo
Josep Pueyo

Reputation: 783

This is quite straightforward using dplyr. Here is an option:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tribble(
  ~ name, ~ clinic, ~ year, ~ date, ~ tested,
  "a",       "xxy",    2022,  "April", "yes",
  "a",       "xxy",    2022,  "May", "no",
  "b",       "ggf",    2019,  "Jan", "no",
  "b",       "ggf",    2019,  "Feb", "yes",
  "c",       "ffr",    2018,  "March", "yes",
  "c",       "ffr",    2019,  "May", "no"
)

df |> 
  mutate(tested2 = if_else(any(tested == "yes"), "yes", "no"), .by = c(name, year))
#> # A tibble: 6 × 6
#>   name  clinic  year date  tested tested2
#>   <chr> <chr>  <dbl> <chr> <chr>  <chr>  
#> 1 a     xxy     2022 April yes    yes    
#> 2 a     xxy     2022 May   no     yes    
#> 3 b     ggf     2019 Jan   no     yes    
#> 4 b     ggf     2019 Feb   yes    yes    
#> 5 c     ffr     2018 March yes    yes    
#> 6 c     ffr     2019 May   no     no

Created on 2024-02-25 with reprex v2.1.0

I would recommend to read this question before posting future questions. It makes easier to help you.

Upvotes: 1

Related Questions