Reputation: 8626
I have a table of 55000 rows, which looks like that (left table):
(the code to generate sample data is below)
Now I need to convert every row of this table to 6 rows, each containing one letter of "hexamer" (right table on the picture) with some calculations:
# input for the function is one row of source table, output is 6 rows
splithexamer <- function(x){
dir <- x$dir # strand direction: +1 or -1
pos <- x$pos # hexamer position
out <- x[0,] # template of output
hexamer <- as.character(x$hexamer)
for (i in 1:nchar(hexamer)) {
letter <- substr(hexamer, i, i)
if (dir==1) {newpos <- pos+i-1;}
else {newpos <- pos+6-i;}
y <- x
y$pos <- newpos
y$letter <- letter
out <- rbind(out,y)
}
return(out);
}
# Sample data generation:
set.seed(123)
size <- 55000
letters <- c("G","A","T","C")
df<-data.frame(
HSid=paste0("Hs.", 1:size),
hexamer=replicate(n=size, paste0(sample(letters,6,replace=T), collapse="")),
chr=sample(c(1:23,"X","Y"),size,replace=T),
pos=sample(1:99999,size,replace=T),
dir=sample(c(1,-1),size,replace=T)
)
Now I would like to get some advices what would be the most efficient way to apply my function to every row. So far I tried the following:
# Variant 1: for() with rbind
tmp <- data.frame()
for (i in 1:nrow(df)){
tmp<-rbind(tmp,splithexamer(df[i,]));
}
# Variant 2: for() with direct writing to file
for (i in 1:nrow(df)){
write.table(splithexamer(df[i,]),file="d:/test.txt",append=TRUE,quote=FALSE,col.names=FALSE)
}
# Variant 3: ddply
tmp<-ddply(df, .(HSid), .fun=splithexamer)
# Variant 4: apply - I don't know correct syntax
tmp<-apply(X=df, 1, FUN=splithexamer) # this causes an error
all of the above is extremely slow, I am wondering if there's better way to solve this task...
Upvotes: 3
Views: 221
Reputation: 118799
Solution using data.table
:
df$hexamer <- as.character(df$hexamer)
dt <- data.table(df)
dt[, id := seq_len(nrow(df))]
setkey(dt, "id")
dt.out <- dt[, { mod.pos <- pos:(pos+5); if(dir == -1) mod.pos <- rev(mod.pos);
list(split = unlist(strsplit(hexamer, "")),
mod.pos = mod.pos)}, by=id][dt][, id := NULL]
dt.out
# split mod.pos HSid hexamer chr pos dir
# 1: G 95982 Hs.1 GCTCCA 5 95982 1
# 2: C 95983 Hs.1 GCTCCA 5 95982 1
# 3: T 95984 Hs.1 GCTCCA 5 95982 1
# 4: C 95985 Hs.1 GCTCCA 5 95982 1
# 5: C 95986 Hs.1 GCTCCA 5 95982 1
# ---
# 329996: A 59437 Hs.55000 AATCTG 7 59436 1
# 329997: T 59438 Hs.55000 AATCTG 7 59436 1
# 329998: C 59439 Hs.55000 AATCTG 7 59436 1
# 329999: T 59440 Hs.55000 AATCTG 7 59436 1
# 330000: G 59441 Hs.55000 AATCTG 7 59436 1
Explanation of the main line:
by=id
will group by id
and since they are all unique, it'll group by every line, one at a time. {}
sets mod.pos
to pos:(pos+6-1)
and if dir == -1
reverses it. list
argument: It creates the column split
by creating 6 nucleotides from your hexamer using strsplit
and also sets mod.pos
which we've already calculated in the step before. data.table
with columns id, split and mod.pos
. [dt]
is a typical usage of data.table's X[Y]
syntax which performs a join on the data.tables based on the key column ( = id
, here). Since there are 6 rows for every id
you get all the other columns in dt
duplicated during this join.I'd suggest you take a look at data.table
FAQ first and then its documentation (intro). These links can be obtained by installing the package and loading it and then typing ?data.table
. I also suggest you work through the many examples in there one by one with a test data.table to understand practically the features of data.table.
Hope this helps.
Upvotes: 3