Reputation: 87

Create new column based on multiple conditions in multiple columns

UPDATED to include multiple options in new variables:

I'm working with a messy big data patient file (> 40 million rows). Each patient (id) has several rows. Each row (roughly) represents one consultation with a symptom/disease code (icpc). I add a new column with categories for patients with specific conditions (based on columns icpc and icpc2).

My raw data.frame (df) looks something like this (this is fabricated data, id is much longer in my dataset and I left out irrelevant columns which I like to drop):

    id icpc icpc2 reg.date 
1:  123 D95 F15   19JUN2015 
2:  123 F85       15AUG2016 
3:  332 A01       16MAR2010 
4:  332 A04       20JAN2018
5:  332 K20       20FEB2017
6:  100 B10       01JUN2017 
7:  100 A04       11JAN2008
8:  113 T08       18MAR2018
9:  113 P28       19JAN2017 
10: 113 D95 A01   16JAN2013
11: 113 A04       01MAY2009
12: 551 B12 A01   03APR2011
13: 551 D95       09MAY2015

Say I want to categorize patients with D95 and/or A01 as yes in the new column named 'condit' (based on two columns icpc and icpc2). The following works:

cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df

But now I want to categorize several codes from icpc and icpc2 in the new column condit. So for instance, D95 and/or A01 from icpc or icpc2 as A, A04 and/or T08 as B, B10 as C in condit. NOTE: A should overwrite B (see row 4, 8 and 11), B should overwrite C etc (as it is possible id may have fall in several categories).

This is the data.frame (df) I want:

    id icpc icpc2 reg.date  condit
1:  123 D95 F15   19JUN2015 A
2:  123 F85       15AUG2016 A
3:  332 A01       16MAR2010 A
4:  332 A04       20JAN2018 A
5:  332 K20       20FEB2017 A
6:  100 B10       01JUN2017 C
7:  100 A04       11JAN2008 C
8:  113 T08       18MAR2018 A
9:  113 P28       19JAN2017 A
10: 113 D95 A01   16JAN2013 A
11: 113 A04       01MAY2009 A
12: 551 B12 A01   03APR2011 A
13: 551 D90       09MAY2015 A

Any help will be greatly appreciated. Thank you!

Upvotes: 1

Answers (4)

Edward

Reputation: 19384

With your large (> 40 million rows) data set, the data.table package might be a good choice:

library(data.table)

cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df

     id icpc icpc2  reg.date condit
 1: 123  D95   F15 19JUN2015    yes
 2: 123  F85       15AUG2016    yes
 3: 332  A01       16MAR2010    yes
 4: 332  A04       20JAN2018    yes
 5: 332  K20       20FEB2017    yes
 6: 100  B10       01JUN2017     no
 7: 100  A04       11JAN2008     no
 8: 113  T08       18MAR2018    yes
 9: 113  P28       19JAN2017    yes
10: 113  D95   A01 16JAN2013    yes
11: 113  A04       01MAY2009    yes
12: 551  B12   A01 03APR2011    yes
13: 551  D95       09MAY2015    yes

Data:

df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01", 
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", 
"D95"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01", 
"", "A01", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010", 
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018", 
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))

Edit: for multiple conditions:

cond1 <- c("D95", "A01") # A
cond2 <- c("A04", "T08") # B
cond3 <- "B10"           # C

setDT(df)[, condit := if(any(icpc %in% cond1 | icpc2 %in% cond1)) "A" else 
                         if(any(icpc %in% cond2 | icpc2 %in% cond2)) "B" else
                            if(any(icpc %in% cond3 | icpc2 %in% cond3)) "C" else "", by=id]

     id icpc icpc2  reg.date condit
 1: 123  D95   F15 19JUN2015      A
 2: 123  F85       15AUG2016      A
 3: 332  A01       16MAR2010      A
 4: 332  A04       20JAN2018      A
 5: 332  K20       20FEB2017      A
 6: 100  B10       01JUN2017      B
 7: 100  A04       11JAN2008      B
 8: 113  T08       18MAR2018      A
 9: 113  P28       19JAN2017      A
10: 113  D95   A01 16JAN2013      A
11: 113  A04       01MAY2009      A
12: 551  B12   B10 03APR2011      C
13: 551  D96       09MAY2015      C

Data: (slightly modified from the original since no "C" condition was found.

df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01", 
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", 
"D96"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01", 
"", "B10", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010", 
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018", 
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))

Tested on a data frame with 40M rows: system.time(...)

#    user  system elapsed 
#  111.11    1.17  111.97

Using dplyr:

# Error: cannot allocate vector of size 274.7 Mb
# Timing stopped at: 4.19 1.11 5.39

Upvotes: 1

G. Grothendieck

Reputation: 270130

Check if any element of c(icpc, icpc2) is in the desired codes. Note that result is a simple character string yes or no so we can just use if rather than if_else.

DF %>%
  group_by(id) %>%
  mutate(condit = if (any(c(icpc, icpc2) %in% c("D95", "A01"))) "yes" else "no") %>%
  ungroup

A second approach is to convert the data to long form in which case there will only be one icpc column (called value), set condit and then convert back to wide form (or you might want to just leave it in long form). The select at the end reorders the columns to be in the same order as the input, i.e. id column first, icpc column second, etc.

library(tidyr)
DF %>%
  pivot_longer(starts_with("icpc")) %>%
  filter(name != "") %>%
  group_by(id) %>%
  mutate(condit = if (any(value %in% c("D95", "A01"))) "yes" else "no") %>%
  pivot_wider %>%
  select(names(DF))

Note

The input in reproducible form is assumed to be:

Lines <- "id icpc icpc2 reg.date 
123 D95 F15   19JUN2015 
123 F85       15AUG2016 
332 A01       16MAR2010 
332 A04       20JAN2018
332 K20       20FEB2017
100 B10       01JUN2017 
100 A04       11JAN2008
113 T08       18MAR2018
113 P28       19JAN2017 
113 D95 A01   16JAN2013
113 A04       01MAY2009
551 B12 A01   03APR2011
551 D95       09MAY2015"
L <- readLines(textConnection(Lines))
L <- sub(" (\\S+) ", ",\\1,", L)
L <- sub(" +", ",", L)
DF <- read.csv(text = L, check.names = FALSE, as.is = TRUE, strip.white = TRUE)

Upvotes: 1

Ronak Shah

Reputation: 389235

You need to use any as you have already discovered since you want to assign values to entire groups.

library(dplyr)
df %>%
  group_by(id) %>%
  mutate(condit = if(any(icpc %in% c("D95", "A01") | icpc2 %in% c("D95", "A01"))) 
                     "yes" else "no")

Or a faster option without if/else would be

df %>%
   group_by(id) %>%
   mutate(condit = c("no", "yes")[(any(icpc %in% c("D95", "A01") | 
                                        icpc2 %in% c("D95", "A01"))) + 1])

Upvotes: 1

linog

Reputation: 6226

I think you should write:

df.cat <- df %>%
  group_by(id) %>%
  mutate(condit = 
           if_else((icpc %in% c("D95", "A01")) | (icpc2 %in% c("D95", "A01")) ~ 'yes', 'no'))

You should not use == when more than one choice is possible. By the way, I added parentheses to your logical comparisons

Upvotes: 2

Create new column based on multiple conditions in multiple columns

Answers (4)

Note

Related Questions