Reputation: 6532
I have a large table (several hundred thousand rows) that has an ID (as string) column followed by several numeric columns which contain measurements from different samples.
I need to do numeric operations, such as scaling and logging the numeric data, and filtering based on variance, then plot etc... The way I have done it so far was to subset the numeric portion of the data into new variables and process it accordingly.
f_na2zero(dt)
dt.num <- dt[,!c("Seq"),with=F]
dt.scaled <- (dt.num + 1) / colSums(dt.num) # +1 to avoid NaN due to logging
dt.log <- log10(dt.scaled)
newdt <- data.table("Seq" = dt$Seq, dt.log)
dt.filtered <- newdt[nchar(Seq) == 207, ]
dt.A <- dt.filtered[, c("Seq", "Lib", "A5_1", "A5_2", "A5_3"), with=F]
dt.B <- dt.filtered[, c("Seq", "Lib", "B5_1", "B5_2", "B5-3"), with=F]
ind.A.highvar <- which(apply(dt.A, 1, var) > sd(as.matrix(dt.A)))
ind.B.highvar <- which(apply(dt.B, 1, var) > sd(as.matrix(dt.B)))
ind.A.highvar2 <- which(apply(dt.A, 1, var) > 2*sd(as.matrix(dt.A)))
ind.B.highvar2 <- which(apply(dt.B, 1, var) > 2*sd(as.matrix(dt.B)))
A.highvar <- dt.A[ind.A.highvar, !c("Seq"), with=F]
A.highvar2 <- dt.A[ind.A.highvar2, !c("Seq"), with=F]
B.highvar <- dt.B[ind.B.highvar, !c("Seq"), with=F]
B.highvar2 <- dt.B[ind.B.highvar2, !c("Seq"), with=F]
par(mfrow=c(2,2))
# plotLines takes a matrix and plots each column as a line, returns indices where data[1,] < data[4,] is true
seqs.A1 <- plotLines(t(A.highvar),"1 Sigma A",c("Lib", "A5-1", "A5-2", "A5-3"))
seqs.B1 <- plotLines(t(B.highvar),"1 Sigma B",c("Lib", "B5-1", "B5-2", "B5-3"))
seqs.A2 <- plotLines(t(A.highvar2),"2 Sigma A",c("Lib", "A5-1", "A5-2", "A5-3"))
seqs.B2 <- plotLines(t(B.highvar2),"2 Sigma B",c("Lib", "B5-1", "B5-2", "B5-3"))
The problem is that I then need the IDs to be able to make use of that data, but the row numbers change as I filter based on different criteria along the way. I figured there needs to be a simpler way of processing the data, without constantly storing new tables and variables, but I can't really make sense out of the data.table
indexing scheme (before you mention it, yes I have read the introduction to data.table vignette)
So in this context, specific questions:
Is it possible to do numeric operations of some columns of the data table (i.e. Skipping column "Seq" when doing a log transform, or logically indexing based on variance) without going through an intermediate variables?
How can I filter rows based on a logical function and oeprate on specific columns at the same time? Do I need to use the grouping operator by
when I want to use .SD
or .SDcolumns
?
Edit As per Roland's comment here's a small script that recreates the type of data I am working with.
dt <- data.table("Seq" = stringi::stri_rand_strings(100000,200,"[A-Z]"), matrix(rnorm(n = 700000, mean=-3, sd = 1.5), nrow = 100000, ncol = 7, dimnames = list(NULL,c("A5_1","A5_2","A5_3","B5_1", "B5_2","B5_3","Lib"))))
Upvotes: 0
Views: 88
Reputation: 132706
I'm sorry, there is nothing reproducible (where is the Seq
column?) here nor is your example minimal nor does it explain what exactly you'd like to achieve.
However, maybe this illustrates what I mean with long format:
dt <- melt(dt, id.vars = "Id")
dt[, ind := .GRP, by = Id]
library(ggplot2)
stdev <- sd(dt[, value])
ggplot(dt[, if(var(value) > (2 * stdev)) .SD, by = Id], #filter
aes(x = ind, y = value, color = variable)) +
geom_line()
Upvotes: 2