Split strings in column dataframe in R and create additional columns for the substrings

Question

When working with genomic array data, a 'probe' is often assigned to different genes (different transcripts). Object df shows an example of this.

df <- data.frame(c("geneA;geneB;geneB", "geneG", "geneC;geneD"))
colnames(df) <- "gene.names"
df#looks like this:

         gene.names
1 geneA;geneB;geneB
2             geneG
3       geneC;geneD

I would like to split all elements in df$gene.names at ; and put each substring in a new column. NA can be used if there is no more genes in a row.

This script works, but I think most people will agree that this a greedy code and not too efficient. Can someone suggest a better alternative?

library(plyr)#load this library first

out <- NULL
for (i in 1:NROW(df)){
    one <- as.data.frame(t(as.data.frame(strsplit(as.character(df[i,1]), ";"))))
    out <- rbind.fill(out, one)
}
out#looks like this:

     V1    V2    V3
1 geneA geneB geneB
2 geneG    
3 geneC geneD

jalapic · Accepted Answer

I recommend using splitstackshape for this:

splitstackshape::cSplit(df, splitCols="gene.names", sep=";")

   gene.names_1 gene.names_2 gene.names_3
1:        geneA        geneB        geneB
2:        geneG           NA           NA
3:        geneC        geneD           NA

Split strings in column dataframe in R and create additional columns for the substrings

Answers (2)

Related Questions