Reputation: 121
I want to extract data.table columns if their contents fulfill a criteria. And I need a method that will work with by (or in some other way within combinations of columns). I am not very experienced with data.table and have tried my best with .SDcol and what else I could think of.
Example: I often have datasets with observations at multiple time points for multiple subjects. They also contain covariates which do not vary within subjects.
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)
How do I (efficiently) select the columns that do not vary within id (in this case, id and age)? I'd like a function call that would return
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)? The function call should return:
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Of course, I am interested if you know of a function that addresses the specific example above, but I am even more curious on how to do this generally. Columns that contain more than two values > 1000 within any combinations of id and time in by=.(id,time), or whatever...
Thanks!
Upvotes: 0
Views: 1056
Reputation: 121
Based on chinsoon12's suggestion, I managed to put something together. I need four steps, and I'm not sure how efficient it is, but at least it does the job. To recap, this is the dataset:
dt1
id time meas age bw
1: 1 1 452 30 75
2: 1 2 23 30 75
3: 2 1 555 54 81
4: 2 2 33 54 81
5: 3 1 322 20 69
6: 3 2 32 20 70
I put this together to get the columns that are constant within "id" (only age):
cols.id <- "id"
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)==1)], by=cols.id]
ifkeep <- dt2[,sapply(.SD,all),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id age
1: 1 30
2: 2 54
3: 3 20
And to get the columns that vary within any value of "id" (age is dropped):
cols.id <- "id"
## differenct from above: ==1 -> >1
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)>1)], by=cols.id]
## difference from above: all -> any
ifkeep <- dt2[,sapply(.SD,any),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Upvotes: 0
Reputation: 27732
This might also be an option:
Count unique values per column, by ID (using data.table::uniqueN
)
Check in which columns the sum of unique values (by group) equals the number of unique IDs (using colSums
)
Only keep (or drop) the wanted columns
library(data.table)
ids <- uniqueN(dt1$id)
#no variation
dt1[, c( TRUE, colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id age
1: 1 30
2: 1 30
3: 2 54
4: 2 54
5: 3 20
6: 3 20
#variation
dt1[, c( TRUE, !colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Upvotes: 0
Reputation: 25225
How do I (efficiently) select the columns that do not vary within id (in this case, id and age)?
Maybe something like:
f <- function(DT, byChar) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)==1])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f(dt1, "id")
output:
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)?
Similarly,
f2 <- function(DT, byChar, k) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)>k])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f2(dt1, "id", 1)
output:
id time meas
1: 1 1 452
2: 1 2 23
3: 2 1 555
4: 2 2 33
5: 3 1 322
6: 3 2 32
data:
library(data.table)
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)
Upvotes: 1