Reputation: 13
I'm pretty new to R and I was wondering if it is possible to separate columns based on their identity? For example, I would like to separate column 5 into new columns with all the ones that start with "E-" in one column and all the "D-" in another column.
chr1 11046 12418 2 E-H3K27ac,D-Both
chr1 18615 19392 2 D-Both,E-Both
chr1 27209 28559 3 E-H3K4me1,D-Both,E-Both
chr1 35274 35492 1 E-H3K4me1
chr1 36589 38097 2 D-Both,E-Both
chr1 43655 45148 2 D-Both,E-Both
chr1 49265 50054 2 D-Both,E-H3K4me1
chr1 50117 50465 1 E-H3K4me1
So that it would look something like:
chr1 11046 12418 2 E-H3K27ac D-Both
chr1 18615 19392 2 E-Both D-Both
chr1 27209 28559 3 E-H3K4me1,E-Both D-Both
chr1 35274 35492 1 E-H3K4me1
chr1 36589 38097 2 E-Both D-Both
chr1 43655 45148 2 E-Both D-Both
chr1 49265 50054 2 E-H3K4me1 D-Both
chr1 50117 50465 1 E-H3K4me1
I'm not sure if rows with 3 entries would need to be dealt with first. (In this example I would probably just convert the 2 E- starts in row 3 to just E-Both). Sorry if a similar question has already been asked. I'm still learning the jargon so my search skills about r problems are limited.
Upvotes: 1
Views: 43
Reputation: 887651
We can do with this with base R
. Split the 'Col' by the delimiter ,
using strsplit
(if it is factor
class, wrap it with as.character(df1$Col)
) into a list
, then paste
the elements after grouping by the first character (substr(x, 1, 1)
), sort
in decreasing order. We pad the elements of list
that have only a single element with NA, rbind
the list
elements and create two new columns 'E_col' and 'D_col'
lst <- lapply(strsplit(df1$Col, ","), function(x)
sort(tapply(x, substr(x, 1, 1), FUN = toString), decreasing=TRUE))
df1[c("E_col", "D_col")] <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
df1
# chr ID1 ID2 val Col E_col D_col
#1 chr1 11046 12418 2 E-H3K27ac,D-Both E-H3K27ac D-Both
#2 chr1 18615 19392 2 D-Both,E-Both E-Both D-Both
#3 chr1 27209 28559 3 E-H3K4me1,D-Both,E-Both E-H3K4me1, E-Both D-Both
#4 chr1 35274 35492 1 E-H3K4me1 E-H3K4me1 <NA>
#5 chr1 36589 38097 2 D-Both,E-Both E-Both D-Both
#6 chr1 43655 45148 2 D-Both,E-Both E-Both D-Both
#7 chr1 49265 50054 2 D-Both,E-H3K4me1 E-H3K4me1 D-Both
#8 chr1 50117 50465 1 E-H3K4me1 E-H3K4me1 <NA>
Or another option is to use cSplit
from splitstackshape
to split the 'Col' and reshape the dataset to 'long' format and then with dcast
we change it to 'wide'
library(splitstackshape)
dcast(cSplit(df1, "Col", ",", "long")[, toString(Col) ,
.(chr, ID1, ID2, val, grp=factor(substr(Col, 1, 1), levels = c("E", "D")))],
... ~ grp, value.var = "V1")
NOTE: As @Frank suggested in the comments, it may be good to keep it in the 'long' format (output of cSplit(df1, "Col", ",", "long")
) instead of combining it back to 'wide' format
df1 <- structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1"), ID1 = c(11046L, 18615L, 27209L, 35274L,
36589L, 43655L, 49265L, 50117L), ID2 = c(12418L, 19392L, 28559L,
35492L, 38097L, 45148L, 50054L, 50465L), val = c(2L, 2L, 3L,
1L, 2L, 2L, 2L, 1L), Col = c("E-H3K27ac,D-Both", "D-Both,E-Both",
"E-H3K4me1,D-Both,E-Both", "E-H3K4me1", "D-Both,E-Both", "D-Both,E-Both",
"D-Both,E-H3K4me1", "E-H3K4me1")), .Names = c("chr", "ID1", "ID2",
"val", "Col"), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 1