StatsNTats
StatsNTats

Reputation: 149

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.

col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery",  "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)

So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.

Thus I somehow need to add an ID column with this data:

ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID

I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?

Advice on relevant tags to use is helpful!

Upvotes: 0

Views: 50

Answers (3)

Mike S
Mike S

Reputation: 312

Using dplyr:

library(dplyr)
testdat = testdat %>%
    mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))

This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.

Upvotes: 1

CPak
CPak

Reputation: 13581

You can try the following

keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1

       col1   col2 ID
1   surgery    yes  1
2       age     54  1
3    weight    153  1
4   albumin normal  1
5  abiotics      2  1
6   surgery     no  2
7       age     65  2
8    weight    134  2
9     BAPPS    yes  2
10 abiotics      1  2
11  surgery    yes  3
12      age     61  3
13   weight    210  3
14      age     46  4
15   weight    178  4
16    BAPPS     no  4
17  albumin    low  4

Upvotes: 1

Grant
Grant

Reputation: 356

Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.

col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery",  "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)

# Use a tibble and get rid of factors. 
dfTest = as_tibble(testdat) %>% 
    mutate_all(as.character)

# A little dplyr magic to see find if the start of a new patient, then give them an id. 
dfTest = dfTest %>%
    mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>% 
    mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>% 
    mutate(patientID = cumsum(isStart)) %>% 
    select(-couldBeStart, -isStart)

# # A tibble: 17 x 3
#    col1     col2   patientID
#    <chr>    <chr>      <int>
#  1 surgery  yes            1
#  2 age      54             1
#  3 weight   153            1
#  4 albumin  normal         1
#  5 abiotics 2              1
#  6 surgery  no             2
#  7 age      65             2
#  8 weight   134            2
#  9 BAPPS    yes            2
# 10 abiotics 1              2
# 11 surgery  yes            3
# 12 age      61             3
# 13 weight   210            3
# 14 age      46             4
# 15 weight   178            4
# 16 BAPPS    no             4
# 17 albumin  low            4

# Get the data to a wide workable format. 
dfTest %>% spread(col1, col2)

# # A tibble: 4 x 7
#   patientID abiotics age   albumin BAPPS surgery weight
#       <int> <chr>    <chr> <chr>   <chr> <chr>   <chr> 
# 1         1 2        54    normal  NA    yes     153   
# 2         2 1        65    NA      yes   no      134   
# 3         3 NA       61    NA      NA    yes     210   
# 4         4 NA       46    low     no    NA      178   

Upvotes: 2

Related Questions