Reputation: 97

Summarise column based on contents of other column

I am trying to write a code that compares two columns in the same dataframe, using the summarise to create a new column that will state if the ID was registered before the Review took place.

This is my dataframe:

tt <- structure(list(ID = c("P40", "P40", "P40", "P42", "P42", "P43", "P43",
                      "P44", "P44"),Type = c("Pre-Initial", "Review", "Review", "Initial", "Review", "Initial", "Review", "Pre-Initial", "Review"),
               Registered = c("Yes", "", "", "No", "", "Yes", "", "No", "")),
          class = "data.frame", row.names = c(NA, -9L))

Result I want to achieve:

ID  Outcome
P40 Yes
P42 No
P43 Yes
P44 No

This is the code I have tried but only shows No for all IDs

tt %>% group_by(ID) %>%
    summarise(outcome = c("No", "Yes")[all(Registered == "Yes" & Type == "Review") + 1])

Upvotes: 0

Answers (3)

arg0naut91

Reputation: 14774

Could try:

tt %>%
  group_by(ID) %>%
  summarise(
    Outcome = c("No", "Yes")[any(Type == "Review" & cumsum(Registered == "Yes") == 1) + 1]
  )

Output:

# A tibble: 4 x 2
  ID    Outcome
  <chr> <chr>  
1 P40   Yes    
2 P42   No     
3 P43   Yes    
4 P44   No

Note that this assumes the Yes for Registered occurs exactly once per each ID. Otherwise just replace cumsum(Registered == "Yes") == 1 with cumsum(Registered == "Yes") >= 1.

Upvotes: 2

Ronak Shah

Reputation: 389235

Another dplyr variant, here it returns "No" if there is no value in Registered as "Yes" or it compares the index of occurrence of it and "Review" and assign values accordingly.

library(dplyr)

tt %>%
  group_by(ID) %>%
  summarise(Outcome = if (any(Registered == "Yes")) 
      c("No", "Yes")[(which.max(Registered == "Yes") < 
                     which.max(Type == "Review"))+1] else "No")

#  ID    Outcome
#  <chr> <chr>  
#1 P40   Yes    
#2 P42   No     
#3 P43   Yes    
#4 P44   No

Upvotes: 2

Konrad Rudolph

Reputation: 546053

I’m not exactly sure what your intended result is but from your description it sounds as if the Type == 'Review' rows are simply irrelevant: All you need is to remove them, and then remove that column (and rename the Registered column):

tt %>%
    filter(Type != 'Review') %>%
    select(- Type, Outcome = Registered)

Upvotes: 0

Summarise column based on contents of other column

Answers (3)

Related Questions