Reputation: 1053
Actually I have same problem with this case strsplit one column with exact information into two column
That question already solved, just my data is just looks like
SNP Geno AlleleA AlleleB AlleleC AlleleD AlleleE
1 marker1 G1 AA AA AA AA AA
2 marker2 G1 TT TT TT TT TT
3 marker3 G1 TT TT TT TT TT
4 marker1 G2 CC CC CC CC CC
5 marker2 G2 AA AA AA AA AA
6 marker3 G2 TT TT TT TT TT
7 marker1 G3 GG GG GG GG GG
8 marker2 G3 AA AA AA AA AA
9 marker3 G3 TT TT TT TT TT
dput output:
structure(list(SNP = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L), .Label = c("marker1", "marker2", "marker3"), class = "factor"),
Geno = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("G1",
"G2", "G3"), class = "factor"), AlleleA = structure(c(1L,
4L, 4L, 2L, 1L, 4L, 3L, 1L, 4L), .Label = c("AA", "CC", "GG",
"TT"), class = "factor"), AlleleB = structure(c(1L, 4L, 4L,
2L, 1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA",
"CC", "GG", "TT")), AlleleC = structure(c(1L, 4L, 4L, 2L,
1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", "CC",
"GG", "TT")), AlleleD = structure(c(1L, 4L, 4L, 2L, 1L, 4L,
3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", "GG",
"TT")), AlleleE = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 3L,
1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", "TT"
))), .Names = c("SNP", "Geno", "AlleleA", "AlleleB", "AlleleC",
"AlleleD", "AlleleE"), row.names = c(NA, -9L), class = "data.frame")
On that question he just has one columns that want to split to two columns. The problem is I have 5000 columns (AlleleA, AlleleB.........etc) that want to split (each one column to two columns)
I've tried to use looping like this but it doesnt work,
for(i in colnames(dat)){
dat1 <- data.frame(do.call(rbind, strsplit(as.vector(sprintf("dat$%s",i)), split = "")))
}
I will wait your light, thank you
Upvotes: 2
Views: 1084
Reputation: 887138
Another option is
library(qdap)
res <- colsplit2df(dat, splitcols=2:ncol(dat),sep='')
colnames(res)[-1] <- make.names(rep(colnames(dat)[-1],each=2), unique=TRUE)
res[1:3,1:5]
# SNP Geno Geno.1 AlleleA AlleleA.1
#1 marker1 G 1 A A
#2 marker2 G 1 T T
#3 marker3 G 1 T T
Or only for Allele
columns
colsplit2df(dat, splitcols=grep('Allele', names(dat)),sep='')
Edit (Tyler Rinker)
May I suggest editing the column names of the data.frame using setNames
first as follows:
setNames(dat, gsub("([A-Z]{1}[a-z]+[A-Z])", "\\1.1&\\1.2", names(dat))) %>%
colsplit2df(splitcols=3:ncol(dat), sep='')
Upvotes: 3
Reputation: 193527
You can use cSplit
from my "splitstackshape" package with the argument stripWhite = FALSE
.
For example, if we wanted to split all the "Allele*" columns, we would do:
library(splitstackshape)
cSplit(mydf, grep("Allele", names(mydf)), "", stripWhite = FALSE)
# SNP Geno AlleleA_1 AlleleA_2 AlleleB_1 AlleleB_2 AlleleC_1
# 1: marker1 G1 A A A A A
# 2: marker2 G1 T T T T T
# 3: marker3 G1 T T T T T
# 4: marker1 G2 C C C C C
# 5: marker2 G2 A A A A A
# 6: marker3 G2 T T T T T
# 7: marker1 G3 G G G G G
# 8: marker2 G3 A A A A A
# 9: marker3 G3 T T T T T
# AlleleC_2 AlleleD_1 AlleleD_2 AlleleE_1 AlleleE_2
# 1: A A A A A
# 2: T T T T T
# 3: T T T T T
# 4: C C C C C
# 5: A A A A A
# 6: T T T T T
# 7: G G G G G
# 8: A A A A A
# 9: T T T T T
Upvotes: 4
Reputation: 2771
As @beginneR says, you can use tidyr::separate
. Here is an example taken from:http://blog.rstudio.org/2014/07/22/introducing-tidyr/
head(tidier, 8)
#> id trt key time
#> 1 1 treatment work.T1 0.08514
#> 2 2 control work.T1 0.22544
#> 3 3 treatment work.T1 0.27453
#> 4 4 control work.T1 0.27231
#> 5 1 treatment home.T1 0.61583
#> 6 2 control home.T1 0.42967
#> 7 3 treatment home.T1 0.65166
#> 8 4 control home.T1 0.56774
tidy <- tidier %>%
separate(key, into = c("location", "time"), sep = "\\.")
tidy %>% head(8)
#> id trt location time time
#> 1 1 treatment work T1 0.08514
#> 2 2 control work T1 0.22544
#> 3 3 treatment work T1 0.27453
#> 4 4 control work T1 0.27231
#> 5 1 treatment home T1 0.61583
#> 6 2 control home T1 0.42967
#> 7 3 treatment home T1 0.65166
#> 8 4 control home T1 0.56774
Upvotes: 2