Reputation: 301
I have a big data.frame
with some columns, but my 9th column is made of data separated by semicolon :
gtf$V9
1 gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
2 gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
3 gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
4 gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
So I would like to cut this column into others columns and merge
this later with the other part of the data.frame
(the others columns before the 9th column).
I've tried some code without results :
head(gtf$V9, sep = ";",stringsAsFactors = FALSE)
or
new_df <- matrix(gtf$V9, ncol=7, byrow=TRUE) # sep = ";"
The same thing with as.data.frame
, data.frame
or as.matrix
I have also tried to write.csv
and import this with includ a sep=";"
, but the data.frame
is too bigger and my computer is lagging..
Any advice?
Upvotes: 2
Views: 212
Reputation: 83275
Another option is to use the splitstackshape
-package (which also loads data.table
). Using:
library(splitstackshape)
cSplit(cSplit(df, 'V9', sep = ';', direction = 'long'),
'V9', sep = ' ')[, dcast(.SD, cumsum(V9_1 == 'gene_id') ~ V9_1)]
gives:
V9_1 conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id 1: 1 9.805420 4.347062 25.616962 NA 7.0762407256 1.000000 CUFF.1 CUFF.1.1 2: 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 3: 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 4: 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1
Upvotes: 3
Reputation: 16121
# example dataset (only variable of interest included)
df = data.frame(V9=c("gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
"gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
"gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
"gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;"),
stringsAsFactors = F)
library(dplyr)
library(tidyr)
df %>%
mutate(id = row_number()) %>% # flag row ids (will need those to reshape data later)
separate_rows(V9, sep="; ") %>% # split strings and create new rows
separate(V9, c("name","value"), sep=" ") %>% # separate column name from value
mutate(value = gsub(";","",value)) %>% # remove ; when necessary
spread(name, value) # reshape data
# id conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id
# 1 1 9.805420 4.347062 25.616962 <NA> 7.0762407256 1.000000 CUFF.1 CUFF.1.1
# 2 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1
# 3 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1
# 4 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1
You can join this dataset back to your initial dataset using the row ids (id
). You need to create an id
in your original dataset as well.
Upvotes: 1
Reputation: 86
you could do strsplit()
within a sapply()
If you know how many objects can be in V9 than you can do a for loop over it
for (i in 1:number_of_max_objects_in_V9) {
gtf[ncol(gtf)+1] = sapply(1:nrow(gtf), function(x) strsplit(gtf$V9[x],',')[[1]][i])
}
if you don't know how many objects can V9 have, then just run a str_count
on ,
in gtf$V9 like this:
library(stringr)
number_of_max_objects_in_V9 <- max(sapply(1:nrow(gtf), function(x) str_count(gtf$V9,',')))
Upvotes: 1