Pierre-louis Stenger
Pierre-louis Stenger

Reputation: 301

Split a data.frame column into others columns

I have a big data.frame with some columns, but my 9th column is made of data separated by semicolon :

    gtf$V9
1                 gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
2  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
3  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
4  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;

So I would like to cut this column into others columns and merge this later with the other part of the data.frame (the others columns before the 9th column).

I've tried some code without results :

head(gtf$V9, sep = ";",stringsAsFactors = FALSE) 

or

new_df <- matrix(gtf$V9, ncol=7, byrow=TRUE) # sep = ";"

The same thing with as.data.frame, data.frame or as.matrix

I have also tried to write.csv and import this with includ a sep=";", but the data.frame is too bigger and my computer is lagging..

Any advice?

Upvotes: 2

Views: 212

Answers (3)

Jaap
Jaap

Reputation: 83275

Another option is to use the splitstackshape-package (which also loads data.table). Using:

library(splitstackshape)
cSplit(cSplit(df, 'V9', sep = ';', direction = 'long'),
       'V9', sep = ' ')[, dcast(.SD, cumsum(V9_1 == 'gene_id') ~ V9_1)]

gives:

   V9_1  conf_hi  conf_lo       cov exon_number         FPKM     frac gene_id transcript_id
1:    1 9.805420 4.347062 25.616962          NA 7.0762407256 1.000000  CUFF.1      CUFF.1.1
2:    2 9.805420 4.347062 25.616962           1 7.0762407256 1.000000  CUFF.1      CUFF.1.1
3:    3 9.805420 4.347062 25.616962           2 7.0762407256 1.000000  CUFF.1      CUFF.1.1
4:    4 9.805420 4.347062 25.616962           3 7.0762407256 1.000000  CUFF.1      CUFF.1.1

Upvotes: 3

AntoniosK
AntoniosK

Reputation: 16121

# example dataset (only variable of interest included)
df = data.frame(V9=c("gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;"),
                stringsAsFactors = F)

library(dplyr)
library(tidyr)

df %>%
  mutate(id = row_number()) %>%                  # flag row ids (will need those to reshape data later)                
  separate_rows(V9, sep="; ") %>%                # split strings and create new rows
  separate(V9, c("name","value"), sep=" ") %>%   # separate column name from value
  mutate(value = gsub(";","",value)) %>%         # remove ; when necessary
  spread(name, value)                            # reshape data

#   id  conf_hi  conf_lo       cov exon_number         FPKM     frac gene_id transcript_id
# 1  1 9.805420 4.347062 25.616962        <NA> 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 2  2 9.805420 4.347062 25.616962           1 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 3  3 9.805420 4.347062 25.616962           2 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 4  4 9.805420 4.347062 25.616962           3 7.0762407256 1.000000  CUFF.1      CUFF.1.1

You can join this dataset back to your initial dataset using the row ids (id). You need to create an id in your original dataset as well.

Upvotes: 1

p130ter
p130ter

Reputation: 86

you could do strsplit() within a sapply()

If you know how many objects can be in V9 than you can do a for loop over it

for (i in 1:number_of_max_objects_in_V9) {
 gtf[ncol(gtf)+1] = sapply(1:nrow(gtf), function(x) strsplit(gtf$V9[x],',')[[1]][i])
}

if you don't know how many objects can V9 have, then just run a str_count on , in gtf$V9 like this:

library(stringr)
number_of_max_objects_in_V9 <- max(sapply(1:nrow(gtf), function(x) str_count(gtf$V9,',')))

Upvotes: 1

Related Questions