Reputation: 401
I wrote a script using the data.table
package to parse out the last column of a GENCODE gtf file. The column, for those unaware, contains a handful of key-value items separated by a semi-colon for each row. The particular file I'm working with contains ~2.5 million rows. I indexed out the first 100, then the first 1000 rows just to test the script and the output is exactly what I need. However, despite using the set
function, the run-time isn't as fast as I expected. It's instantaneous with the first 100 rows, but takes about a minute or two for the first 1000. Here is the script.
#LOAD DATA.TABLE LIBRARY
require(data.table)
#READ GTF ANNOTATION FILE
info <- fread("gencodeAnnotation.gtf")
colnames(info)[9] <- "AdditionalInfo"
info <- info[1:1000]
#CREATE LIST OF 'KEYS' TO PARSE OUT
pars <- as.character(list("gene_id", "gene_type", "gene_status", "gene_name", " level ", "transcript_name", "transcript_id", "transcript_type", "transcript_support_level", "havana_gene"))
#NESTED FOR LOOP TO PARSE KEY-VALUE PAIR
for (i in 1:length(pars)) {
for (j in 1:nrow(info)) {
infoRow <- info[,tstrsplit(AdditionalInfo, ';', fixed = T)][j]
headerCheck <- like(infoRow, pars[i])
if (any(headerCheck) == TRUE) {
keyVal <- length(tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T))
set(info, i = j, j = toupper(pars[i]), value = tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T)[[keyVal]])
} else {
set(info, i = j, j = toupper(pars[i]), value = NA)
}
}
}
As I said before, the output is perfect when tested on the first 100, 1000 rows. Based on the code, it has to loop through all the rows multiplied by the number of columns to add, or the items in pars
. My question is, what's missing in my script or what edits can I make to reduce run-time? Here is the link for the gtf file being used: http://www.gencodegenes.org/releases/current.html. It is the first link labeled "Comprehensive gene annotation". Thanks in advance.
SAMPLE OF WHAT EACH ROW LOOKS LIKE:
gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2; remap_status full_contig; remap_num_mappings 1; remap_target_status overlap;
Upvotes: 0
Views: 2370
Reputation: 31
If you know Julia language, you can try OpenGene(https://github.com/OpenGene/OpenGene.jl), which has built-in gencode parsing.
using OpenGene, OpenGene.Reference
# load the gencode dataset, it will download a file from gencode website if it not downloaded before
# once it's loaded, it will be cached so future loads will be fast
index = gencode_load("GRCh37")
# locate which gene chr:pos is in
gencode_locate(index, "chr5", 149526621)
# it will return
# 1-element Array{Any,1}:
# Dict{ASCIIString,Any}("gene"=>"PDGFRB","number"=>1,"transcript"=>"ENST00000261799.4","type"=>"intron")
genes = gencode_genes(index, "TP53")
# return an array with only one record
genes[1].name, genes[1].chr, genes[1].start_pos, genes[1].end_pos
# ("TP53","chr17",7565097,7590856)
Upvotes: 1
Reputation: 118789
I find the readGFF
function from the bioconductor package rtracklayer quite appropriate here.
require(rtracklayer)
system.time(gtf <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L))
# user system elapsed
# 34.558 1.541 36.737
dim(gtf)
# [1] 2572840 26
You can also select just the tags you like.
system.time(gtf_tags <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L,
tags = c("gene_id", "transcript_id")))
# user system elapsed
# 16.830 0.733 17.780
dim(gtf_tags)
# [1] 2572840 10
Upvotes: 7
Reputation: 49448
Based on MRE from fanli's answer:
rbindlist(info[, .(list(eval(parse(text=
paste0('list(',
sub(',$', '',
gsub('([^ ]+) ([^;]+); *', '\\1=\\2,', AdditionalInfo)),
')')), .SD)))
, by = AdditionalInfo]$V1, fill = T)
# gene_id gene_type gene_status gene_name level havana_gene transcript_id transcript_type
#1: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 NA NA
#2: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#3: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#4: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#5: ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1 2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
# transcript_status transcript_name tag transcript_support_level havana_transcript exon_number exon_id
#1: NA NA NA NA NA NA NA
#2: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 NA NA
#3: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 1 ENSE00002234944.1
#4: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 2 ENSE00003582793.1
#5: KNOWN DDX11L1-002 basic 1 OTTHUMT00000362751.1 3 ENSE00002312635.1
The above basically replaces spaces with =
, and ;
with commas, slaps a list
in front of that string and parses it as an R expression, converting it to an actual list. The rest is just massaging it into the correct shape.
Upvotes: 0
Reputation: 1099
MRE:
> dput(info[1:5,])
structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1"),
V2 = c("HAVANA", "HAVANA", "HAVANA", "HAVANA", "HAVANA"),
V3 = c("gene", "transcript", "exon", "exon", "exon"), V4 = c(11869L,
11869L, 11869L, 12613L, 13221L), V5 = c(14409L, 14409L, 12227L,
12721L, 14409L), V6 = c(".", ".", ".", ".", "."), V7 = c("+",
"+", "+", "+", "+"), V8 = c(".", ".", ".", ".", "."), AdditionalInfo = c("gene_id \"ENSG00000223972.5\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; level 2; havana_gene \"OTTHUMG00000000961.2\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 1; exon_id \"ENSE00002234944.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 2; exon_id \"ENSE00003582793.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";",
"gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 3; exon_id \"ENSE00002312635.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "AdditionalInfo"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x16489e8>)
Use vectorized operations like lapply
instead of for
loops.
keys <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) unlist(strsplit(y, " "))[1])) )
values <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) gsub("\"", "", unlist(strsplit(y, " "))[2]))) )
You can figure out what to do with resulting keys
and values
.
> keys[[1]]
[1] "gene_id" "gene_type" "gene_status" "gene_name" "level"
[6] "havana_gene"
> values[[1]]
[1] "ENSG00000223972.5" "transcribed_unprocessed_pseudogene"
[3] "KNOWN" "DDX11L1"
[5] "2" "OTTHUMG00000000961.2;"
It's okay, all bioinformaticians have to start somewhere. :)
Upvotes: 1