Reputation: 87
UPDATED to include multiple options in new variables:
I'm working with a messy big data patient file (> 40 million rows). Each patient (id
) has several rows. Each row (roughly) represents one consultation with a symptom/disease code (icpc
). I add a new column with categories for patients with specific conditions (based on columns icpc
and icpc2
).
My raw data.frame (df
) looks something like this (this is fabricated data, id
is much longer in my dataset and I left out irrelevant columns which I like to drop):
id icpc icpc2 reg.date
1: 123 D95 F15 19JUN2015
2: 123 F85 15AUG2016
3: 332 A01 16MAR2010
4: 332 A04 20JAN2018
5: 332 K20 20FEB2017
6: 100 B10 01JUN2017
7: 100 A04 11JAN2008
8: 113 T08 18MAR2018
9: 113 P28 19JAN2017
10: 113 D95 A01 16JAN2013
11: 113 A04 01MAY2009
12: 551 B12 A01 03APR2011
13: 551 D95 09MAY2015
Say I want to categorize patients with D95
and/or A01
as yes in the new column named 'condit' (based on two columns icpc
and icpc2
). The following works:
cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df
But now I want to categorize several codes from icpc
and icpc2
in the new column condit
. So for instance, D95 and/or A01 from icpc
or icpc2
as A
, A04
and/or T08
as B
, B10
as C
in condit
. NOTE: A
should overwrite B
(see row 4, 8 and 11), B
should overwrite C
etc (as it is possible id may have fall in several categories).
This is the data.frame (df
) I want:
id icpc icpc2 reg.date condit
1: 123 D95 F15 19JUN2015 A
2: 123 F85 15AUG2016 A
3: 332 A01 16MAR2010 A
4: 332 A04 20JAN2018 A
5: 332 K20 20FEB2017 A
6: 100 B10 01JUN2017 C
7: 100 A04 11JAN2008 C
8: 113 T08 18MAR2018 A
9: 113 P28 19JAN2017 A
10: 113 D95 A01 16JAN2013 A
11: 113 A04 01MAY2009 A
12: 551 B12 A01 03APR2011 A
13: 551 D90 09MAY2015 A
Any help will be greatly appreciated. Thank you!
Upvotes: 1
Views: 466
Reputation: 19384
With your large (> 40 million rows) data set, the data.table package might be a good choice:
library(data.table)
cond1 <- c("D95", "A01")
setDT(df)[, condit := ifelse(any(icpc %in% cond1 | icpc2 %in% cond1), "yes","no"), by=id]
df
id icpc icpc2 reg.date condit
1: 123 D95 F15 19JUN2015 yes
2: 123 F85 15AUG2016 yes
3: 332 A01 16MAR2010 yes
4: 332 A04 20JAN2018 yes
5: 332 K20 20FEB2017 yes
6: 100 B10 01JUN2017 no
7: 100 A04 11JAN2008 no
8: 113 T08 18MAR2018 yes
9: 113 P28 19JAN2017 yes
10: 113 D95 A01 16JAN2013 yes
11: 113 A04 01MAY2009 yes
12: 551 B12 A01 03APR2011 yes
13: 551 D95 09MAY2015 yes
Data:
df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L,
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01",
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12",
"D95"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01",
"", "A01", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010",
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018",
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))
Edit: for multiple conditions:
cond1 <- c("D95", "A01") # A
cond2 <- c("A04", "T08") # B
cond3 <- "B10" # C
setDT(df)[, condit := if(any(icpc %in% cond1 | icpc2 %in% cond1)) "A" else
if(any(icpc %in% cond2 | icpc2 %in% cond2)) "B" else
if(any(icpc %in% cond3 | icpc2 %in% cond3)) "C" else "", by=id]
id icpc icpc2 reg.date condit
1: 123 D95 F15 19JUN2015 A
2: 123 F85 15AUG2016 A
3: 332 A01 16MAR2010 A
4: 332 A04 20JAN2018 A
5: 332 K20 20FEB2017 A
6: 100 B10 01JUN2017 B
7: 100 A04 11JAN2008 B
8: 113 T08 18MAR2018 A
9: 113 P28 19JAN2017 A
10: 113 D95 A01 16JAN2013 A
11: 113 A04 01MAY2009 A
12: 551 B12 B10 03APR2011 C
13: 551 D96 09MAY2015 C
Data: (slightly modified from the original since no "C" condition was found.
df <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L,
113L, 113L, 113L, 113L, 551L, 551L), icpc = c("D95", "F85", "A01",
"A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12",
"D96"), icpc2 = c("F15", "", "", "", "", "", "", "", "", "A01",
"", "B10", ""), reg.date = c("19JUN2015", "15AUG2016", "16MAR2010",
"20JAN2018", "20FEB2017", "01JUN2017", "11JAN2008", "18MAR2018",
"19JAN2017", "16JAN2013", "01MAY2009", "03APR2011", "09MAY2015"
)), class = "data.frame", row.names = c(NA, -13L))
Tested on a data frame with 40M rows: system.time(...)
# user system elapsed
# 111.11 1.17 111.97
Using dplyr:
# Error: cannot allocate vector of size 274.7 Mb
# Timing stopped at: 4.19 1.11 5.39
Upvotes: 1
Reputation: 270130
Check if any element of c(icpc, icpc2)
is in the desired codes. Note that result is a simple character string yes
or no
so we can just use if
rather than if_else
.
DF %>%
group_by(id) %>%
mutate(condit = if (any(c(icpc, icpc2) %in% c("D95", "A01"))) "yes" else "no") %>%
ungroup
A second approach is to convert the data to long form in which case there will only be one icpc column (called value
), set condit
and then convert back to wide form (or you might want to just leave it in long form). The select
at the end reorders the columns to be in the same order as the input, i.e. id column first, icpc column second, etc.
library(tidyr)
DF %>%
pivot_longer(starts_with("icpc")) %>%
filter(name != "") %>%
group_by(id) %>%
mutate(condit = if (any(value %in% c("D95", "A01"))) "yes" else "no") %>%
pivot_wider %>%
select(names(DF))
The input in reproducible form is assumed to be:
Lines <- "id icpc icpc2 reg.date
123 D95 F15 19JUN2015
123 F85 15AUG2016
332 A01 16MAR2010
332 A04 20JAN2018
332 K20 20FEB2017
100 B10 01JUN2017
100 A04 11JAN2008
113 T08 18MAR2018
113 P28 19JAN2017
113 D95 A01 16JAN2013
113 A04 01MAY2009
551 B12 A01 03APR2011
551 D95 09MAY2015"
L <- readLines(textConnection(Lines))
L <- sub(" (\\S+) ", ",\\1,", L)
L <- sub(" +", ",", L)
DF <- read.csv(text = L, check.names = FALSE, as.is = TRUE, strip.white = TRUE)
Upvotes: 1
Reputation: 389235
You need to use any
as you have already discovered since you want to assign values to entire groups.
library(dplyr)
df %>%
group_by(id) %>%
mutate(condit = if(any(icpc %in% c("D95", "A01") | icpc2 %in% c("D95", "A01")))
"yes" else "no")
Or a faster option without if
/else
would be
df %>%
group_by(id) %>%
mutate(condit = c("no", "yes")[(any(icpc %in% c("D95", "A01") |
icpc2 %in% c("D95", "A01"))) + 1])
Upvotes: 1
Reputation: 6226
I think you should write:
df.cat <- df %>%
group_by(id) %>%
mutate(condit =
if_else((icpc %in% c("D95", "A01")) | (icpc2 %in% c("D95", "A01")) ~ 'yes', 'no'))
You should not use ==
when more than one choice is possible. By the way, I added parentheses to your logical comparisons
Upvotes: 2