user3478697
user3478697

Reputation: 253

How Can I split a comma in new column in r

I have this data

CHOM POS REF ALT
1    121  A   AA,AT
2    254  GCGC  GCGCG,AGCG
3    214  C    T

I need to split the ALT column to be

CHOM POS REF       ALT        ALT1    ALT2 ...
1    121  A        AA         AT        0
2    254  GCGC    GCGCG      AGCG       0
3    214   C        T         0         0

I tried this but the error is

alt=x$ALT
strsplit(alt, ",")

Note: There are many different ALT and REF, the max of coulmn according to the comma is 4. if there is acomma just put the value 0 or NA

Upvotes: 2

Views: 128

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

New Answer

I would write a function like the following to split the column:

splitFun <- function(inVec, sep = ",", newName = "ALT", fill = NA) {
  if (!is.character(inVec)) inVec <- as.character(inVec)
  X <- strsplit(inVec, sep, fixed = TRUE)
  cols <- vapply(X, length, 1L)
  M <- matrix(
    fill, nrow = length(inVec), ncol = max(cols),
    dimnames = list(NULL, make.unique(rep(newName, max(cols)), sep="")))
  M[cbind(rep(sequence(length(X)), cols), sequence(cols))] <- 
    unlist(X, use.names=FALSE)
  M
}

Usage is simple:

splitFun(mydf$ALT)  ## Modify default arguments accordingly
#      ALT     ALT1   ALT2
# [1,] "AA"    "AT"   NA  
# [2,] "GCGCG" "AGCG" NA  
# [3,] "GCGCG" "AT"   "AA"
cbind(mydf, splitFun(mydf$ALT))
#   CHOM POS  REF         ALT   ALT ALT1 ALT2
# 1    1 121    A       AA,AT    AA   AT <NA>
# 2    2 254 GCGC  GCGCG,AGCG GCGCG AGCG <NA>
# 3    1 123 GCGC GCGCG,AT,AA GCGCG   AT   AA

The timing should be pretty efficient. Here's a comparison with the "splitstackshape" approach (which would also handle unbalanced situations).

system.time(splitstackshape:::read.concat(
  bigDf$ALT, sep=",", col.prefix="ALT"))
#    user  system elapsed 
#   1.197   0.000   1.202 
system.time(splitFun(bigDf$ALT))
#    user  system elapsed 
#   0.069   0.000   0.068 

For the above, the sample data used was:

mydf <- data.frame(CHOM = c(1, 2, 1), POS = c(121, 254, 123), 
                   REF = c("A", "GCGC", "GCGC"), 
                   ALT = c("AA,AT", "GCGCG,AGCG", "GCGCG,AT,AA"))
mydf
#   CHOM POS  REF         ALT
# 1    1 121    A       AA,AT
# 2    2 254 GCGC  GCGCG,AGCG
# 3    1 123 GCGC GCGCG,AT,AA

bigDf <- do.call(rbind, replicate(10000, mydf, simplify = FALSE))

Old Answer

You can try concat.split from my "splitstackshape" package:

library(splitstackshape)
concat.split(mydf, "ALT", ",")  ## Add `drop = TRUE` to drop the original column
#   CHOM POS  REF        ALT ALT_1 ALT_2
# 1    1 121    A      AA,AT    AA    AT
# 2    2 254 GCGC GCGCG,AGCG GCGCG  AGCG

There is also colsplit from the "reshape2" package:

library(reshape2)
colsplit(as.character(mydf$ALT), ",", c("ALT", "ALT1"))
#     ALT ALT1
# 1    AA   AT
# 2 GCGCG AGCG

You can use cbind to add the output to your original dataset.

Upvotes: 4

IRTFM
IRTFM

Reputation: 263301

> dat[ c("ALT", "ALT1")] <- read.table(text=as.character(dat$ALT), sep=",")
> dat
  CHOM POS  REF   ALT ALT1
1    1 121    A    AA   AT
2    2 254 GCGC GCGCG AGCG

Upvotes: 1

Jilber Urbina
Jilber Urbina

Reputation: 61154

Consider your data is dat

> dat2 <- data.frame(dat[, -4], sapply(strsplit(levels(dat$ALT), ","), cbind))
> colnames(dat2)[4:5] <- c("ALT", "ALT1")
> dat2
  CHOM POS  REF ALT  ALT1
1    1 121    A  AA GCGCG
2    2 254 GCGC  AT  AGCG

Upvotes: 2

Related Questions