user150272
user150272

Reputation: 91

Search indexes in data.table R

I have a data.table, proce, where each line defines a "special procedure". Now, I have another data.table with the patient procedures, codes. For each person, I want to extract the indexes of "special procedures" that match with his/her procedures (if they have any). Here is an example:

library(data.table)
proce <- data.table(v1 = c('o09513','o721','o701','z370'), v2 = c('0w8nxzz','10d07z6','0tqd7zz','0uqg0zz'),
                         v3 = c('3e030vj','3e033vj',NA,NA))

codes <- data.table(a1 =  c(list(c('o721','10d07z6','3e033vj')),
                            list(c('z370','0uqg0zz',"0tqd7zz","o701")),
                            list(c('o09513','o721','o701','z370','0uqg8zz'))))
> proce
       v1      v2      v3
1: o09513 0w8nxzz 3e030vj
2:   o721 10d07z6 3e033vj
3:   o701 0tqd7zz    <NA>
4:   z370 0uqg0zz    <NA>

> codes
                              a1
1:          o721,10d07z6,3e033vj
2:     z370,0uqg0zz,0tqd7zz,o701
3: o09513,o721,o701,z370,0uqg8zz

Implementation here, but since both tables have hundred thousands of lines, it's slow.


index_procedures <- list()     
for(i in 1:nrow(codes)){ # i <- 2
  a2 <- unlist(codes[i,a1])
  index_procedures[[i]] <- which(apply(proce[,.(v1,v2,v3)], 1,function(x) all(x[!is.na(x)] %in% a2)))
}
index_procedures
> index_procedures
[[1]]
[1] 2

[[2]]
[1] 3 4

[[3]]
integer(0)

Upvotes: 1

Views: 72

Answers (2)

h3rm4n
h3rm4n

Reputation: 4187

I'm not sure about performance, but the following code might be an alternative:

pl <- split(as.matrix(proce), seq_len(nrow(proce)))
pl <- lapply(pl, na.omit)

codes[, indexes := lapply(a1, function(x) which(unlist(lapply(pl, function(p) all(p %in% x)))) )]

Upvotes: 0

Uwe
Uwe

Reputation: 42544

If I understand correctly,

  • codes contains procedure steps which have been applied to a patient. One row in codes refers to one patient.
  • proce contains procedure steps which constitute a special procedure.

The OP wants to identify which special procedures have been applied on each patient (if any). Thereby, a special procedure is only considered to have been applied on a patient if all of its procedure steps have applied.

To solve this, I suggest to reshape all data in a tidy format, i.e., in long format, first.

Then we can join on procedure steps, filter for complete special procedures and aggregate to get one per patient:

lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
  , n_steps := .N, by = pid][]
lp[lc, on = .(step)][
  , .N == first(n_steps), by = .(cid, pid)][
    (V1), .(pid = toString(sort(pid))), by = cid]
   cid  pid
1:   1    2
2:   2 3, 4

Note that the pids are shown in a condensed form for demonstration only; other output formats are available as well depending on subsequent processing steps.

If it is required to show all patients even if they have not received a special procedure:

lp[lc, on = .(step)][, .N == first(n_steps), by = .(cid, pid)][
  V1 | is.na(V1), .(pid = toString(sort(pid))), by = cid]
   cid  pid
1:   1    2
2:   2 3, 4
3:   3

Commented code

# reshape data to long format, thereby adding a row number to identify patients
lc <- codes[, cid := .I][, .(step = unlist(a1)), by = cid]
# reshape data to long format, thereby adding a row number to identify special procdures
lp <- melt(proce[, pid := .I], "pid", na.rm = TRUE, value.name = "step")[
  # count the number of procedure steps which constitute a special procedure
  , n_steps := .N, by = pid][]
# join on procedure steps
lp[lc, on = .(step)][
  # group  by patient and special procedure and test for completeness of steps 
  , .N == first(n_steps), by = .(cid, pid)][
    # filter for complete special procedures and aggregate to get one row per patient
    (V1), .(pid = toString(sort(pid))), by = cid]

After reshaping, lc is

    cid    step
 1:   1    o721
 2:   1 10d07z6
 3:   1 3e033vj
 4:   2    z370
 5:   2 0uqg0zz
 6:   2 0tqd7zz
 7:   2    o701
 8:   3  o09513
 9:   3    o721
10:   3    o701
11:   3    z370
12:   3 0uqg8zz

and lp is

    pid variable    step n_steps
 1:   1       v1  o09513       3
 2:   2       v1    o721       3
 3:   3       v1    o701       2
 4:   4       v1    z370       2
 5:   1       v2 0w8nxzz       3
 6:   2       v2 10d07z6       3
 7:   3       v2 0tqd7zz       2
 8:   4       v2 0uqg0zz       2
 9:   1       v3 3e030vj       3
10:   2       v3 3e033vj       3

Upvotes: 1

Related Questions