Can I separate a column based on its identity?

Question

I'm pretty new to R and I was wondering if it is possible to separate columns based on their identity? For example, I would like to separate column 5 into new columns with all the ones that start with "E-" in one column and all the "D-" in another column.

chr1    11046   12418   2   E-H3K27ac,D-Both
chr1    18615   19392   2   D-Both,E-Both
chr1    27209   28559   3   E-H3K4me1,D-Both,E-Both
chr1    35274   35492   1   E-H3K4me1
chr1    36589   38097   2   D-Both,E-Both
chr1    43655   45148   2   D-Both,E-Both
chr1    49265   50054   2   D-Both,E-H3K4me1
chr1    50117   50465   1   E-H3K4me1

So that it would look something like:

chr1    11046   12418   2   E-H3K27ac        D-Both
chr1    18615   19392   2   E-Both           D-Both
chr1    27209   28559   3   E-H3K4me1,E-Both D-Both
chr1    35274   35492   1   E-H3K4me1
chr1    36589   38097   2   E-Both           D-Both
chr1    43655   45148   2   E-Both           D-Both
chr1    49265   50054   2   E-H3K4me1        D-Both
chr1    50117   50465   1   E-H3K4me1

I'm not sure if rows with 3 entries would need to be dealt with first. (In this example I would probably just convert the 2 E- starts in row 3 to just E-Both). Sorry if a similar question has already been asked. I'm still learning the jargon so my search skills about r problems are limited.

akrun · Accepted Answer

We can do with this with base R. Split the 'Col' by the delimiter , using strsplit (if it is factor class, wrap it with as.character(df1$Col)) into a list, then paste the elements after grouping by the first character (substr(x, 1, 1)), sort in decreasing order. We pad the elements of list that have only a single element with NA, rbind the list elements and create two new columns 'E_col' and 'D_col'

lst <- lapply(strsplit(df1$Col, ","), function(x) 
    sort(tapply(x, substr(x, 1, 1), FUN = toString), decreasing=TRUE))
df1[c("E_col", "D_col")] <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
df1
#   chr   ID1   ID2 val                     Col             E_col  D_col
#1 chr1 11046 12418   2        E-H3K27ac,D-Both         E-H3K27ac D-Both
#2 chr1 18615 19392   2           D-Both,E-Both            E-Both D-Both
#3 chr1 27209 28559   3 E-H3K4me1,D-Both,E-Both E-H3K4me1, E-Both D-Both
#4 chr1 35274 35492   1               E-H3K4me1         E-H3K4me1   
#5 chr1 36589 38097   2           D-Both,E-Both            E-Both D-Both
#6 chr1 43655 45148   2           D-Both,E-Both            E-Both D-Both
#7 chr1 49265 50054   2        D-Both,E-H3K4me1         E-H3K4me1 D-Both
#8 chr1 50117 50465   1               E-H3K4me1         E-H3K4me1

Or another option is to use cSplit from splitstackshape to split the 'Col' and reshape the dataset to 'long' format and then with dcast we change it to 'wide'

library(splitstackshape)
dcast(cSplit(df1, "Col", ",", "long")[, toString(Col) ,
  .(chr, ID1, ID2, val, grp=factor(substr(Col, 1, 1), levels = c("E", "D")))], 
             ... ~ grp, value.var = "V1")

NOTE: As @Frank suggested in the comments, it may be good to keep it in the 'long' format (output of cSplit(df1, "Col", ",", "long")) instead of combining it back to 'wide' format

data

df1 <- structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1"), ID1 = c(11046L, 18615L, 27209L, 35274L, 
36589L, 43655L, 49265L, 50117L), ID2 = c(12418L, 19392L, 28559L, 
35492L, 38097L, 45148L, 50054L, 50465L), val = c(2L, 2L, 3L, 
1L, 2L, 2L, 2L, 1L), Col = c("E-H3K27ac,D-Both", "D-Both,E-Both", 
"E-H3K4me1,D-Both,E-Both", "E-H3K4me1", "D-Both,E-Both", "D-Both,E-Both", 
"D-Both,E-H3K4me1", "E-H3K4me1")), .Names = c("chr", "ID1", "ID2", 
"val", "Col"), class = "data.frame", row.names = c(NA, -8L))

Can I separate a column based on its identity?

Answers (1)

data

Related Questions