Reputation: 69

Find the column index of an element for each row of a data frame

I have a large data frame (~4.5m rows), each row corresponds to a separate admission to hospital.

Within each admission are up to 20 diagnosis codes in columns #7 to #26. In addition, I have a field assigned as the "main diagnosis". It was my assumption that the "main diagnosis" corresponded to the first of the 20 diagnosis codes. That is incorrect - sometimes it's the 1st, others the 2nd, 3rd, etc. I'm interested in that distribution.

ID        MainDiagCode  Diag_1  Diag_2  Diag_3 ...
Patient1  J123          J123    R343    S753
Patient2  G456          F119    E159    G456
Patient3  T789          L292    T789    W474

I'd like to add a column to my data frame that tells me which of the 20 diagnosis codes matches to the "main" one.

ID        MainDiagCode  Diag_1  Diag_2  Diag_3 ...  NewColumn
Patient1  J123          J123    R343    S753        1
Patient2  G456          F119    E159    G456        3
Patient3  T789          L292    T789    W474        2

I've been able to get a loop running:

   df$NewColumn[i] <-
  unname(which(apply(df[i, 7:26], 2, function(x)
    any(
      grepl(df$MainDiagCode[i], x)
    ))))

I'm wondering if there's a better way to do this without using a loop, as that's very slow indeed.

Thank you in advance.

Upvotes: 2

Answers (4)

alexis_laz

Reputation: 13122

With 20 diagnoses and 4.5m patients it might be more efficient to use a simple loop over columns and search for matches:

ff = function(main, diags)
{
    ans = rep_len(NA_integer_, length(main))
    for(i in seq_along(diags)) ans[main == diags[[i]]] = i      
    return(ans)
}
ff(as.character(dat$MainDiagCode), lapply(dat[-(1:2)], as.character))
#[1] 1 3 2

If more than one diagnosis matches the main you might need adjustments to return the first and not the last (as above) diagnosis. Perhaps, it might be even more efficient to reduce the number of rows checked in each iteration depending on when a match is found.

dat = structure(list(PatientID = structure(1:3, .Label = c("Patient1", 
"Patient2", "Patient3"), class = "factor"), MainDiagCode = structure(c(2L, 
1L, 3L), .Label = c("G456", "J123", "T789"), class = "factor"), 
    Diag_1 = structure(c(2L, 1L, 3L), .Label = c("F119", "J123", 
    "L292"), class = "factor"), Diag_2 = structure(c(2L, 1L, 
    3L), .Label = c("E159", "R343", "T789"), class = "factor"), 
    Diag_3 = structure(c(2L, 1L, 3L), .Label = c("G456", "S753", 
    "W474"), class = "factor")), .Names = c("PatientID", "MainDiagCode", 
"Diag_1", "Diag_2", "Diag_3"), row.names = c(NA, -3L), class = "data.frame")

Upvotes: 3

eipi10

Reputation: 93871

df$NewColumn = apply(df, 1, function(x) match(x["MainDiagCode"], x[-c(1,2)]))

df

        ID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
1 Patient1         J123   J123   R343   S753         1
2 Patient2         G456   F119   E159   G456         3
3 Patient3         T789   L292   T789   W474         2

It's safer to return the actual column name rather than relying on the match position to be equal to the diagnosis number. For example:

# Get the names of the diagnosis columns
diag.cols = names(df)[grep("^Diag", names(df))]

Extract the column name of the matched column:

apply(df, 1, function(x) {
      names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])]
})
[1] "Diag_1" "Diag_3" "Diag_2"

Extract the number at the end of the matched column name:

library(stringr)

apply(df, 1, function(x) {
  as.numeric(
    str_extract(
      names(df[,diag.cols])[match(x["MainDiagCode"], x[diag.cols])], "[0-9]{1,2}$")
    )
  })

[1] 1 3 2

Upvotes: 3

cderv

Reputation: 6552

As you have a lot of rows, using data.table could improve performance

library(data.table)
DT <- data.table(PatientID = paste0("Patient", 1:3), 
                 MainDiagCode = c("J123",  "G456", "T789"),
                 Diag_1 = c("J123", "F119", "L292"),
                 Diag_2 = c("R343", "E159", "T789"),
                 Diag_3 = c("S753", "G456", "W474")
)

DT[, NewColumn := match(MainDiagCode, .SD[, -1, with = F]), by = PatientID]
DT
#>    PatientID MainDiagCode Diag_1 Diag_2 Diag_3 NewColumn
#> 1:  Patient1         J123   J123   R343   S753         1
#> 2:  Patient2         G456   F119   E159   G456         3
#> 3:  Patient3         T789   L292   T789   W474         2

Upvotes: 0

IRTFM

Reputation: 263481

This does a row-by-row comparison of the three columns to the 'MainDiagCode':

apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] )  )
[1] 1 3 2

So :

dat$NewColumn <- apply( dat[-1], 1, function(x) which( x[-1] == x['MainDiagCode'] )  )

Upvotes: 2

Find the column index of an element for each row of a data frame

Answers (4)

Related Questions