Reputation: 69
I have a large data frame (~4.5m rows), each row corresponds to a separate admission to hospital.
Within each admission are up to 20 diagnosis codes in columns #7 to #26. In addition, I have a field assigned as the "main diagnosis". It was my assumption that the "main diagnosis" corresponded to the first of the 20 diagnosis codes. That is incorrect - sometimes it's the 1st, others the 2nd, 3rd, etc. I'm interested in that distribution.
ID MainDiagCode Diag_1 Diag_2 Diag_3 ...
Patient1 J123 J123 R343 S753
Patient2 G456 F119 E159 G456
Patient3 T789 L292 T789 W474
I'd like to add a column to my data frame that tells me which of the 20 diagnosis codes matches to the "main" one.
ID MainDiagCode Diag_1 Diag_2 Diag_3 ... NewColumn
Patient1 J123 J123 R343 S753 1
Patient2 G456 F119 E159 G456 3
Patient3 T789 L292 T789 W474 2
I've been able to get a loop running:
df$NewColumn[i] <-
unname(which(apply(df[i, 7:26], 2, function(x)
any(
grepl(df$MainDiagCode[i], x)
))))
I'm wondering if there's a better way to do this without using a loop, as that's very slow indeed.
Thank you in advance.
Upvotes: 2
Views: 574
Reputation: 13122
With 20 diagnoses and 4.5m patients it might be more efficient to use a simple loop over columns and search for matches:
ff = function(main, diags)
{
ans = rep_len(NA_integer_, length(main))
for(i in seq_along(diags)) ans[main == diags[[i]]] = i
return(ans)
}
ff(as.character(dat$MainDiagCode), lapply(dat[-(1:2)], as.character))
#[1] 1 3 2
If more than one diagnosis matches the main you might need adjustments to return the first and not the last (as above) diagnosis. Perhaps, it might be even more efficient to reduce the number of rows checked in each iteration depending on when a match is found.
dat = structure(list(PatientID = structure(1:3, .Label = c("Patient1",
"Patient2", "Patient3"), class = "factor"), MainDiagCode = structure(c(2L,
1L, 3L), .Label = c("G456", "J123", "T789"), class = "factor"),
Diag_1 = structure(c(2L, 1L, 3L), .Label = c("F119", "J123",
"L292"), class = "factor"), Diag_2 = structure(c(2L, 1L,
3L), .Label = c("E159", "R343", "T789"), class = "factor"),
Diag_3 = structure(c(2L, 1L, 3L), .Label = c("G456", "S753",
"W474"), class = "factor")), .Names = c("PatientID", "MainDiagCode",
"Diag_1", "Diag_2", "Diag_3"), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 3
Reputation: 93871
df$NewColumn = apply(df, 1, function(x) match(x["MainDiagCode"], x[-c(1,2)]))
df
ID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
1 Patient1 J123 J123 R343 S753 1
2 Patient2 G456 F119 E159 G456 3
3 Patient3 T789 L292 T789 W474 2
It's safer to return the actual column name rather than relying on the match position to be equal to the diagnosis number. For example:
# Get the names of the diagnosis columns
diag.cols = names(df)[grep("^Diag", names(df))]
Extract the column name of the matched column:
apply(df, 1, function(x) {
names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])]
})
[1] "Diag_1" "Diag_3" "Diag_2"
Extract the number at the end of the matched column name:
library(stringr)
apply(df, 1, function(x) {
as.numeric(
str_extract(
names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])], "[0-9]{1,2}$")
)
})
[1] 1 3 2
Upvotes: 3
Reputation: 6552
As you have a lot of rows, using data.table
could improve performance
library(data.table)
DT <- data.table(PatientID = paste0("Patient", 1:3),
MainDiagCode = c("J123", "G456", "T789"),
Diag_1 = c("J123", "F119", "L292"),
Diag_2 = c("R343", "E159", "T789"),
Diag_3 = c("S753", "G456", "W474")
)
DT[, NewColumn := match(MainDiagCode, .SD[, -1, with = F]), by = PatientID]
DT
#> PatientID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
#> 1: Patient1 J123 J123 R343 S753 1
#> 2: Patient2 G456 F119 E159 G456 3
#> 3: Patient3 T789 L292 T789 W474 2
Upvotes: 0
Reputation: 263481
This does a row-by-row comparison of the three columns to the 'MainDiagCode':
apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] ) )
[1] 1 3 2
So :
dat$NewColumn <- apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] ) )
Upvotes: 2