Reputation: 7720
Say I have a data.table
prueba <- data.table(aa=1:7,
bb=c(1,2,NA,NA,3,1,1),
cc=c(1,2,NA,NA,3,1,1),
YEAR=c(1,1,1,2,2,2,2))
and I want to get the table of missings at each column from a given set, for example aa, bb and cc.
The result should be like this:
aa bb cc 1: 0 2 2 0: 7 0 0
or its transposed form or with other labels.
I've tried with
prueba[,lapply(.SD, function(x) as.list( table(
factor(is.na(x), levels=c("0","1"))))),
.SDcols=c("aa","bb", "cc")]
but I'm getting this instead:
aa bb cc 1: 7 5 5 0: 7 2 2
I think it has to do with the fact that table drops unused levels. But I've unsuccessfully tried with xtabs and all sort of hacking.
I could get something ugly with
sapply(c("aa","bb","cc"), function(x) prueba[,as.list(
table(is.na(get(x))))])
Upvotes: 2
Views: 106
Reputation: 25225
Here is another suggestion that should be general enough. First, build contingency table based on requirements. Next, convert table output into a list and rbindlist
all the results together. Finally, replace NA with 0 counts.
output <- prueba[, rbindlist(
lapply(.SD, function(x) as.list(table(is.na(x)))),
fill=TRUE,
idcol=TRUE),
.SDcols=aa:cc]
output[, lapply(.SD, function(x) replace(x, is.na(x), 0L))]
output:
.id FALSE TRUE
1: aa 7 0
2: bb 5 2
3: cc 5 2
edit: adding another general approach:
#build and flatten contingency table
tab <- prueba[, as.list(unlist(lapply(.SD, function(x) table(is.na(x))))),
.SDcols=aa:cc]
#melt, split original column names and then pivot
dcast(
melt(tab, measure.vars=names(tab))[,
c("V1","Factor") := tstrsplit(variable, split="\\.")],
Factor ~ V1,
function(x) x[1L],
fill=0L)
output:
Factor aa bb cc
1: FALSE 7 5 5
2: TRUE 0 2 2
edit: add timings
set.seed(0L)
sz <- 1e6
nc <- 10
DT <- as.data.table(matrix(sample(c(NA_integer_, 1L:10L), sz*nc, TRUE), ncol=nc))
setnames(DT, paste0("C", 1L:nc))
cols <- names(DT)
mtd1 <- function() {
DT[, table(is.na(.SD), names(.SD)[col(.SD)]), .SDcols=cols]
}
mtd2 <- function() {
DT[, table(is.na(.SD), rep(names(.SD), each=.N)), .SDcols=cols]
}
mtd3 <- function() {
melt(DT[, ..cols], measure.vars=cols)[, table(is.na(value), variable)]
}
mtd4 <- function() {
tab <- DT[, as.list(unlist(lapply(.SD, function(x) table(is.na(x))))),
.SDcols=cols]
dcast(melt(tab, measure.vars=names(tab))[, c("V1","Factor") := tstrsplit(variable, split="\\.")],
Factor ~ V1, function(x) x[1L], fill=0L)
}
mtd5 <- function() {
output <- DT[, rbindlist(lapply(.SD, function(x) as.list(table(is.na(x)))), fill=TRUE, idcol=TRUE),
.SDcols=cols]
output[, lapply(.SD, function(x) replace(x, is.na(x), 0L))]
}
library(microbenchmark)
microbenchmark(mtd1(), mtd2(), mtd3(), mtd4(), mtd5(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
mtd1() 5.044369 5.049252 5.086534 5.054135 5.107617 5.161100 3 b
mtd2() 5.106796 5.110014 5.474269 5.113232 5.658005 6.202778 3 b
mtd3() 2.395127 2.461463 2.509938 2.527799 2.567344 2.606888 3 a
mtd4() 2.138672 2.142300 2.145895 2.145927 2.149506 2.153084 3 a
mtd5() 2.113367 2.175346 2.228162 2.237325 2.285560 2.333794 3 a
Upvotes: 2
Reputation: 66819
Maybe use table
:
prueba[, table(is.na(.SD), names(.SD)[col(.SD)]), .SDcols=aa:cc]
aa bb cc
FALSE 7 5 5
TRUE 0 2 2
This is essentially treating it like a matrix.
Some alternatives:
prueba[, table(is.na(.SD), rep(names(.SD), each=.N)), .SDcols=aa:cc]
melt(prueba[, aa:cc])[, table(is.na(value), variable)]
Upvotes: 4
Reputation: 7720
OK, I've found a solution, a little bit convoluted:
prueba[, lapply(.SD, function(x) as.list( table(factor(
is.na(x), levels=c(F,T)))) ), .SDcols=c("aa","bb", "cc")]
There should be an easier way.
Upvotes: 1
Reputation: 81693
Here's an approach with base R:
rbind(tmp <- colSums(is.na(prueba[ , -"YEAR"])), nrow(prueba) - tmp)
# aa bb cc
# [1,] 0 2 2
# [2,] 7 5 5
Upvotes: 3