Doing numeric operations on a data.table which includes string columns

Question

I have a large table (several hundred thousand rows) that has an ID (as string) column followed by several numeric columns which contain measurements from different samples.

I need to do numeric operations, such as scaling and logging the numeric data, and filtering based on variance, then plot etc... The way I have done it so far was to subset the numeric portion of the data into new variables and process it accordingly.

f_na2zero(dt)
dt.num <- dt[,!c("Seq"),with=F]
dt.scaled <- (dt.num + 1) / colSums(dt.num) # +1 to avoid NaN due to logging
dt.log <- log10(dt.scaled)

newdt <- data.table("Seq" = dt$Seq, dt.log)
dt.filtered <- newdt[nchar(Seq) == 207, ]
dt.A <- dt.filtered[, c("Seq", "Lib", "A5_1", "A5_2", "A5_3"), with=F]
dt.B <- dt.filtered[, c("Seq", "Lib", "B5_1", "B5_2", "B5-3"), with=F]

ind.A.highvar <- which(apply(dt.A, 1, var) > sd(as.matrix(dt.A)))
ind.B.highvar <- which(apply(dt.B, 1, var) > sd(as.matrix(dt.B)))
ind.A.highvar2 <- which(apply(dt.A, 1, var) > 2*sd(as.matrix(dt.A)))
ind.B.highvar2 <- which(apply(dt.B, 1, var) > 2*sd(as.matrix(dt.B)))

A.highvar <- dt.A[ind.A.highvar, !c("Seq"), with=F]
A.highvar2 <- dt.A[ind.A.highvar2, !c("Seq"), with=F]
B.highvar <- dt.B[ind.B.highvar, !c("Seq"), with=F]
B.highvar2 <- dt.B[ind.B.highvar2, !c("Seq"), with=F]

par(mfrow=c(2,2))
# plotLines takes a matrix and plots each column as a line, returns indices where data[1,] < data[4,] is true
seqs.A1 <- plotLines(t(A.highvar),"1 Sigma A",c("Lib", "A5-1", "A5-2", "A5-3"))
seqs.B1 <- plotLines(t(B.highvar),"1 Sigma B",c("Lib", "B5-1", "B5-2", "B5-3"))
seqs.A2 <- plotLines(t(A.highvar2),"2 Sigma A",c("Lib", "A5-1", "A5-2", "A5-3"))
seqs.B2 <- plotLines(t(B.highvar2),"2 Sigma B",c("Lib", "B5-1", "B5-2", "B5-3"))

The problem is that I then need the IDs to be able to make use of that data, but the row numbers change as I filter based on different criteria along the way. I figured there needs to be a simpler way of processing the data, without constantly storing new tables and variables, but I can't really make sense out of the data.table indexing scheme (before you mention it, yes I have read the introduction to data.table vignette)

So in this context, specific questions:

Is it possible to do numeric operations of some columns of the data table (i.e. Skipping column "Seq" when doing a log transform, or logically indexing based on variance) without going through an intermediate variables?
How can I filter rows based on a logical function and oeprate on specific columns at the same time? Do I need to use the grouping operator by when I want to use .SD or .SDcolumns?

Edit As per Roland's comment here's a small script that recreates the type of data I am working with.

dt <- data.table("Seq" = stringi::stri_rand_strings(100000,200,"[A-Z]"), matrix(rnorm(n = 700000, mean=-3, sd = 1.5), nrow = 100000, ncol = 7, dimnames = list(NULL,c("A5_1","A5_2","A5_3","B5_1", "B5_2","B5_3","Lib"))))

Doing numeric operations on a data.table which includes string columns

Answers (1)

Related Questions