John Doe
John Doe

Reputation: 121

Select data.table columns based on condition, within by

I want to extract data.table columns if their contents fulfill a criteria. And I need a method that will work with by (or in some other way within combinations of columns). I am not very experienced with data.table and have tried my best with .SDcol and what else I could think of.

Example: I often have datasets with observations at multiple time points for multiple subjects. They also contain covariates which do not vary within subjects.

dt1 <- data.table(
    id=c(1,1,2,2,3,3),
    time=c(1,2,1,2,1,2),
    meas=c(452,23,555,33,322,32),
    age=c(30,30,54,54,20,20),
    bw=c(75,75,81,81,69,70)
)

How do I (efficiently) select the columns that do not vary within id (in this case, id and age)? I'd like a function call that would return

    id age
1:  1  30
2:  2  54
3:  3  20

And how do I select the columns that do vary within ID (so drop age)? The function call should return:

   id time meas bw
1:  1    1  452 75
2:  1    2   23 75
3:  2    1  555 81
4:  2    2   33 81
5:  3    1  322 69
6:  3    2   32 70

Of course, I am interested if you know of a function that addresses the specific example above, but I am even more curious on how to do this generally. Columns that contain more than two values > 1000 within any combinations of id and time in by=.(id,time), or whatever...

Thanks!

Upvotes: 0

Views: 1056

Answers (3)

John Doe
John Doe

Reputation: 121

Based on chinsoon12's suggestion, I managed to put something together. I need four steps, and I'm not sure how efficient it is, but at least it does the job. To recap, this is the dataset:

dt1
   id time meas age bw
1:  1    1  452  30 75
2:  1    2   23  30 75
3:  2    1  555  54 81
4:  2    2   33  54 81
5:  3    1  322  20 69
6:  3    2   32  20 70

I put this together to get the columns that are constant within "id" (only age):

cols.id <- "id"
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)==1)], by=cols.id]
ifkeep <- dt2[,sapply(.SD,all),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
   id age
1:  1  30
2:  2  54
3:  3  20

And to get the columns that vary within any value of "id" (age is dropped):

cols.id <- "id"
## differenct from above: ==1 -> >1
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)>1)], by=cols.id]
## difference from above: all -> any
ifkeep <- dt2[,sapply(.SD,any),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])

   id time meas bw
1:  1    1  452 75
2:  1    2   23 75
3:  2    1  555 81
4:  2    2   33 81
5:  3    1  322 69
6:  3    2   32 70

Upvotes: 0

Wimpel
Wimpel

Reputation: 27732

This might also be an option:

Count unique values per column, by ID (using data.table::uniqueN )
Check in which columns the sum of unique values (by group) equals the number of unique IDs (using colSums)
Only keep (or drop) the wanted columns

library(data.table)
ids <- uniqueN(dt1$id)
#no variation
dt1[, c( TRUE, colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]

   id age
1:  1  30
2:  1  30
3:  2  54
4:  2  54
5:  3  20
6:  3  20

#variation
dt1[, c( TRUE, !colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]

   id time meas bw
1:  1    1  452 75
2:  1    2   23 75
3:  2    1  555 81
4:  2    2   33 81
5:  3    1  322 69
6:  3    2   32 70

Upvotes: 0

chinsoon12
chinsoon12

Reputation: 25225

How do I (efficiently) select the columns that do not vary within id (in this case, id and age)?

Maybe something like:

f <- function(DT, byChar) {
    cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)==1])), byChar]$V1)
    unique(DT[, c(byChar, cols), with=FALSE])
}
f(dt1, "id")

output:

   id age
1:  1  30
2:  2  54
3:  3  20

And how do I select the columns that do vary within ID (so drop age)?

Similarly,

f2 <- function(DT, byChar, k) {
    cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)>k])), byChar]$V1)
    unique(DT[, c(byChar, cols), with=FALSE])
}
f2(dt1, "id", 1)

output:

   id time meas
1:  1    1  452
2:  1    2   23
3:  2    1  555
4:  2    2   33
5:  3    1  322
6:  3    2   32

data:

library(data.table)
dt1 <- data.table(
    id=c(1,1,2,2,3,3),
    time=c(1,2,1,2,1,2),
    meas=c(452,23,555,33,322,32),
    age=c(30,30,54,54,20,20),
    bw=c(75,75,81,81,69,70)
)

Upvotes: 1

Related Questions