jayinbluecity
jayinbluecity

Reputation: 81

How to count the variables depending on the appearance of another value in the same column?

I want to count the number of death before first breast cancer, after first breast cancer, after second breast cancer etc...

my data looks like this, when you see the EVENT column you can see some people has death event before BC1(first breast cancer), after BC1 or after BC2. I was wondering how to count the number of people in this each sequence

     PERSON_ID     EVENT 
3   10000000002     BC1
4   10000000002     R_B
5   10000000002     BC2
6   10000000002   DEATH
7   10000000002     EPI
20  10000000007     BC1
21  10000000007     BC2
22  10000000007   DEATH
23  10000000010   DEATH
37  10000000827   DEATH
38  10000000830     BC1
39  10000000830   DEATH 

I want some tables like this, but I dont worry about making tables now. I just want the corresponding count for

Death before BC1: #
Death after  BC1: #
Death after  BC2: #

sorry for bad formatting in the post, any help will be appreciated!

Upvotes: 0

Views: 89

Answers (3)

Marc Flury
Marc Flury

Reputation: 341

I think we can assume that no person gets breast cancer after they die so you can check if a person has BC2 as an EVENT and if they do they died after the second time the caught cancer.

    library("dplyr")
    df <- data.frame(PERSON_ID = c(10000000002, 10000000002, 10000000002,
                                   10000000002,
                     10000000002, 10000000007, 10000000007, 10000000007, 
                     10000000010, 10000000827, 10000000830, 10000000830),
                     EVENT = c("BC1", "R_B", "BC2", "DEATH",
                               "EPI", "BC1", "BC2", "DEATH",
                               "DEATH", "DEATH", "BC1", "DEATH" ))


    group_by(df, PERSON_ID) %>%
              summarise(Type = ifelse("BC2" %in% EVENT, "BC2",
                   ifelse("BC1" %in% EVENT, "BC1",
                          "BC0"))) %>%
      ungroup() %>%
      group_by(Type) %>%
      summarise(Count = n())

Cheers

Upvotes: 1

Hugo
Hugo

Reputation: 507

Here is a very simple solution. The results are stored into the results variable.

my_data<- data.frame(PERSON_ID = as.character(c(10000000002,10000000002,10000000002,10000000002,10000000002,10000000007,10000000007,10000000007,10000000010,10000000827,10000000830,10000000830)),
    EVENT  = c("BC1","R_B","BC2","DEATH","EPI","BC1","BC2","DEATH","DEATH","DEATH","BC1","DEATH"))

my_function <- function(ID){
    person <- subset(my_data, PERSON_ID == ID)
    a <- which(person $EVENT == "DEATH")
    b <- which(person $EVENT == "BC1")
    c <- which(person $EVENT == "BC2")
    if(length(b) == 0){return("Death_before_BC1")}
    else if(length(c) == 0){return("Death_after_BC1")}
    else{return("Death_after_BC2")}
    }

results_tmp <- sapply(as.character(unique(my_data$PERSON_ID)), my_function)

results <- data.frame(Death_before_BC1 = sum(results_tmp == "Death_before_BC1"), 
    Death_after_BC1 = sum(results_tmp == "Death_after_BC1"),
    Death_after_BC2 = sum(results_tmp == "Death_after_BC2"))

Upvotes: 0

cderv
cderv

Reputation: 6552

you could transform your data.frame in order to help you. One way to go with dplyr after reshaping with tidyr is underneath. There surely more solutions

library(dplyr)
df <- readr::read_delim("PERSON_ID EVENT
10000000002 BC1
10000000002 R_B
10000000002 BC2
10000000002 DEATH
10000000002 EPI
10000000007 BC1
10000000007 BC2
10000000007 DEATH
10000000010 DEATH
10000000827 DEATH
10000000830 BC1
10000000830 DEATH", delim = " ")

# transform your data to create a new categorical column with what you want

new_df <- df %>% 
  mutate(value = T) %>% 
  tidyr::spread(EVENT, value, fill = F) %>%
  group_by(PERSON_ID) %>%
  mutate(cat = if_else(BC1 && BC2, "after BC2", if_else(BC1, "after BC1", "before BC1"))) %>%
  ungroup() 

new_df
#> # A tibble: 5 × 7
#>   PERSON_ID   BC1   BC2 DEATH   EPI   R_B        cat
#>       <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>      <chr>
#> 1     1e+10  TRUE  TRUE  TRUE  TRUE  TRUE  after BC2
#> 2     1e+10  TRUE  TRUE  TRUE FALSE FALSE  after BC2
#> 3     1e+10 FALSE FALSE  TRUE FALSE FALSE before BC1
#> 4     1e+10 FALSE FALSE  TRUE FALSE FALSE before BC1
#> 5     1e+10  TRUE FALSE  TRUE FALSE FALSE  after BC1

# count the variable

new_df %>% count(cat)
#> # A tibble: 3 × 2
#>          cat     n
#>        <chr> <int>
#> 1  after BC1     1
#> 2  after BC2     2
#> 3 before BC1     2

Upvotes: 0

Related Questions