Reputation: 376
I have a large data.table
of genotypes (260,000 rows by 1000 columns). The rows are markers and the columns are the subjects. The data looks like this:
ID1 ID2 ID3 ID4
M1: CC CC TC CC
M2: GG GG GG GG
M3: TT TT TT TT
M4: TG TG TG TG
M5: TT TT TT TT
M6: TT TT TT TT
I need to split each genotype so that I have each allele in its own column like this:
V1 V2 V3 V4 V5 V6 V7 V8
M1: C C C C T C C C
M2: G G G G G G G G
M3: T T T T T T T T
M4: T G T G T G T G
M5: T T T T T T T T
M6: T T T T T T T T
I have come up with two solutions, both of which work on a subset of the data, but breaks down on the entire data set due to memory issues or some internal error of data.table
that I dont understand.
I used strsplit
on each column and stored it to a list, then used do.call
to merge them all. I also parallelized it using the foreach
function
ids <- colnames(DT)
gene.split <- function(i) {
as.data.table(do.call(rbind,strsplit(as.vector(eval(parse(text=paste("DT$",ids[i])))), split = "")))
}
all.gene <- foreach(i=1:length(ids)) %dopar% gene.split(i)
do.call(cbind,all.gene)
On 4 cores this breaks down due to memory issues.
The second solution is based on a similar problem which uses the set
function:
out_names <- paste("V", 1:(2*ncol(DT)), sep="_")
invar1 <- names(DT)
for (i in seq_along(invar1)) {
set(DT, i=NULL, j=out_names[2*i-1], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,1])
set(DT, i=NULL, j=out_names[2*i], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,2])
}
which works on a few columns but then I get the following error if I try using the entire dataset:
Error in set(DT, i = NULL, j = out_names[2 * i - 1], value = do.call(rbind, : Internal logical error. DT passed to assign has not been allocated enough column slots. l=163, tl=163, adding 1
Am I going about this the wrong way?
Upvotes: 0
Views: 941
Reputation: 115485
Here is an approach using data.table::set
and substr
(not strsplit)
Using @jbaums example data l
# coerce to `data.table` without a copy
setDT(l)
# over allocate columns so that `data.table` can assign by reference
# this will stop the error you were seeing
alloc.col(l,3000)
out_names <- paste("V", 1:(2*ncol(l)), sep="_")
invar1 <- names(l)
for (i in seq_along(invar1)) {
set(l, i=NULL, j=out_names[2*i-1], value=substr(l[[invar1[i]]],1,1))
set(l, i=NULL, j=out_names[2*i], value=substr(l[[invar1[i]]],2,2))
}
The final step took 37 seconds on my Windows 7 i7 2600 machine with 8GB ram
In your example you run strsplit twice (and use do.call(rbind....)
) --> not efficient.
Some benchmarking of possible approaches to the splitting....
microbenchmark(substr(l[[invar1[1L]]],2,2), sapply(strsplit(l[[invar1[1L]]],''),`[`,2L),do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[,2], times=5)
Unit: milliseconds
expr min lq median uq max neval
substr(l[[invar1[1L]]], 2, 2) 14.10669 14.35571 14.57485 15.78283 193.9125 5
sapply(strsplit(l[[invar1[1L]]], ""), `[`, 2L) 345.92969 1420.03907 1944.33873 3864.82876 5371.6130 5
do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[, 2] 3318.70878 4131.38551 4155.06126 5269.92745 8414.4948 5
Upvotes: 3
Reputation: 27408
Here's a relatively fast approach - took ~80 sec (after dummy data creation) (Win 8.1 x64; i4770) but chewed up ~13 GB of RAM.
# Creating initial data
pairs <- c(outer(c('C', 'T', 'G', 'A'), c('C', 'T', 'G', 'A'), 'paste0'))
l <- replicate(1000, sample(pairs, 260000, replace=TRUE), simplify=FALSE)
system.time({
v <- do.call(paste0, l)
rm(l); gc()
out <- do.call(rbind, strsplit(v, ''))
rm(v); gc()
})
# user system elapsed
# 79.07 1.24 80.33
str(out)
# chr [1:260000, 1:2000] "A" "C" "C" "C" ...
Upvotes: 2
Reputation: 42689
Here's a way to do this for a data frame x
:
do.call(cbind,
lapply(x,
function(i) do.call(rbind, strsplit(as.character(i), split=''))
)
)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "C" "C" "C" "C" "T" "C" "C" "C"
[2,] "G" "G" "G" "G" "G" "G" "G" "G"
[3,] "T" "T" "T" "T" "T" "T" "T" "T"
[4,] "T" "G" "T" "G" "T" "G" "T" "G"
[5,] "T" "T" "T" "T" "T" "T" "T" "T"
[6,] "T" "T" "T" "T" "T" "T" "T" "T"
Each column is split into characters, and then r-bound together. This gives a list of columns, which are then passed to cbind
.
Upvotes: 0
Reputation: 1677
## make a small data.table for testing
dd <- data.table(ID1=c("CC","TG"),ID2=c("CC","TG"), ID3=c("TC","TG"))
dd
## ID1 ID2 ID3
## 1: CC CC TC
## 2: TG TG TG
## the first base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][1])
## ID1 ID2 ID3
## [1,] "C" "C" "T"
## [2,] "T" "T" "T"
## the second base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][2])
## ID1 ID2 ID3
## [1,] "C" "C" "C"
## [2,] "G" "G" "G"
## These results are in matrix, if you need data.table use as.data.table to convert them back.
Upvotes: -1