Reputation: 4928
I am trying to verify in a data.table object which columns have non-null data (is not NA) values greater than a certain threshold (for example: 5), and subsequently discard the columns which do not pass in the criteria.
Consider the following data:
require(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,NA,6), v=c(1,2,NA,NA,NA,NA,NA,8,9))
DT
x y v
1: a 1 1
2: a NA 2
3: a 6 NA
4: b 1 NA
5: b NA NA
6: b 6 NA
7: c 1 NA
8: c NA 8
9: c 6 9
In the above example, column v has only 4 non NA values, which is smaller than 5, so I'd like to discard the column:
DT[,c(3) := NULL]
DT
x y
1: a 1
2: a NA
3: a 6
4: b 1
5: b NA
6: b 6
7: c 1
8: c NA
9: c 6
I am needing help to understand the way to go combining the .N
* symbol and 'if statements' with data.table to check an object with many columns.
My question is, how could I do the count programmatically in all columns, and discard only the ones which not pass the criteria? Tks.
*I am not sure if .N
is needed but from previous research I understood this symbol is used for counting inside data.table objects
Upvotes: 1
Views: 1792
Reputation: 49448
Here is one way of doing it:
DT[, which(lapply(DT, function(x) sum(!is.na(x))) < 5) := NULL]
Since data.table
is a list of columns, lapply
loops over the individual columns and applies the required function. After that which
enumerates the columns we're interested in, and :=
removes them.
Upvotes: 3