Reputation: 253
I have this data
CHOM POS REF ALT
1 121 A AA,AT
2 254 GCGC GCGCG,AGCG
3 214 C T
I need to split the ALT column to be
CHOM POS REF ALT ALT1 ALT2 ...
1 121 A AA AT 0
2 254 GCGC GCGCG AGCG 0
3 214 C T 0 0
I tried this but the error is
alt=x$ALT
strsplit(alt, ",")
Note: There are many different ALT and REF, the max of coulmn according to the comma is 4. if there is acomma just put the value 0 or NA
Upvotes: 2
Views: 128
Reputation: 193517
I would write a function like the following to split the column:
splitFun <- function(inVec, sep = ",", newName = "ALT", fill = NA) {
if (!is.character(inVec)) inVec <- as.character(inVec)
X <- strsplit(inVec, sep, fixed = TRUE)
cols <- vapply(X, length, 1L)
M <- matrix(
fill, nrow = length(inVec), ncol = max(cols),
dimnames = list(NULL, make.unique(rep(newName, max(cols)), sep="")))
M[cbind(rep(sequence(length(X)), cols), sequence(cols))] <-
unlist(X, use.names=FALSE)
M
}
Usage is simple:
splitFun(mydf$ALT) ## Modify default arguments accordingly
# ALT ALT1 ALT2
# [1,] "AA" "AT" NA
# [2,] "GCGCG" "AGCG" NA
# [3,] "GCGCG" "AT" "AA"
cbind(mydf, splitFun(mydf$ALT))
# CHOM POS REF ALT ALT ALT1 ALT2
# 1 1 121 A AA,AT AA AT <NA>
# 2 2 254 GCGC GCGCG,AGCG GCGCG AGCG <NA>
# 3 1 123 GCGC GCGCG,AT,AA GCGCG AT AA
The timing should be pretty efficient. Here's a comparison with the "splitstackshape" approach (which would also handle unbalanced situations).
system.time(splitstackshape:::read.concat(
bigDf$ALT, sep=",", col.prefix="ALT"))
# user system elapsed
# 1.197 0.000 1.202
system.time(splitFun(bigDf$ALT))
# user system elapsed
# 0.069 0.000 0.068
For the above, the sample data used was:
mydf <- data.frame(CHOM = c(1, 2, 1), POS = c(121, 254, 123),
REF = c("A", "GCGC", "GCGC"),
ALT = c("AA,AT", "GCGCG,AGCG", "GCGCG,AT,AA"))
mydf
# CHOM POS REF ALT
# 1 1 121 A AA,AT
# 2 2 254 GCGC GCGCG,AGCG
# 3 1 123 GCGC GCGCG,AT,AA
bigDf <- do.call(rbind, replicate(10000, mydf, simplify = FALSE))
You can try concat.split
from my "splitstackshape" package:
library(splitstackshape)
concat.split(mydf, "ALT", ",") ## Add `drop = TRUE` to drop the original column
# CHOM POS REF ALT ALT_1 ALT_2
# 1 1 121 A AA,AT AA AT
# 2 2 254 GCGC GCGCG,AGCG GCGCG AGCG
There is also colsplit
from the "reshape2" package:
library(reshape2)
colsplit(as.character(mydf$ALT), ",", c("ALT", "ALT1"))
# ALT ALT1
# 1 AA AT
# 2 GCGCG AGCG
You can use cbind
to add the output to your original dataset.
Upvotes: 4
Reputation: 263301
> dat[ c("ALT", "ALT1")] <- read.table(text=as.character(dat$ALT), sep=",")
> dat
CHOM POS REF ALT ALT1
1 1 121 A AA AT
2 2 254 GCGC GCGCG AGCG
Upvotes: 1
Reputation: 61154
Consider your data is dat
> dat2 <- data.frame(dat[, -4], sapply(strsplit(levels(dat$ALT), ","), cbind))
> colnames(dat2)[4:5] <- c("ALT", "ALT1")
> dat2
CHOM POS REF ALT ALT1
1 1 121 A AA GCGCG
2 2 254 GCGC AT AGCG
Upvotes: 2