abbas786
abbas786

Reputation: 401

Parse GTF File from Gencode

I wrote a script using the data.table package to parse out the last column of a GENCODE gtf file. The column, for those unaware, contains a handful of key-value items separated by a semi-colon for each row. The particular file I'm working with contains ~2.5 million rows. I indexed out the first 100, then the first 1000 rows just to test the script and the output is exactly what I need. However, despite using the set function, the run-time isn't as fast as I expected. It's instantaneous with the first 100 rows, but takes about a minute or two for the first 1000. Here is the script.

#LOAD DATA.TABLE LIBRARY
require(data.table)
#READ GTF ANNOTATION FILE
info <- fread("gencodeAnnotation.gtf")
colnames(info)[9] <- "AdditionalInfo"
info <- info[1:1000]
#CREATE LIST OF 'KEYS' TO PARSE OUT
pars <- as.character(list("gene_id", "gene_type", "gene_status", "gene_name", " level ", "transcript_name", "transcript_id", "transcript_type", "transcript_support_level", "havana_gene"))
#NESTED FOR LOOP TO PARSE KEY-VALUE PAIR
for (i in 1:length(pars)) {
    for (j in 1:nrow(info)) {
        infoRow <- info[,tstrsplit(AdditionalInfo, ';', fixed = T)][j]
        headerCheck <- like(infoRow, pars[i])
        if (any(headerCheck) == TRUE) {
          keyVal <- length(tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T))
          set(info, i = j, j = toupper(pars[i]), value = tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T)[[keyVal]])   
        } else {
          set(info, i = j, j = toupper(pars[i]), value = NA)    
        }
    }
}

As I said before, the output is perfect when tested on the first 100, 1000 rows. Based on the code, it has to loop through all the rows multiplied by the number of columns to add, or the items in pars. My question is, what's missing in my script or what edits can I make to reduce run-time? Here is the link for the gtf file being used: http://www.gencodegenes.org/releases/current.html. It is the first link labeled "Comprehensive gene annotation". Thanks in advance.

SAMPLE OF WHAT EACH ROW LOOKS LIKE:

gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2; remap_status full_contig; remap_num_mappings 1; remap_target_status overlap;

Upvotes: 0

Views: 2370

Answers (4)

Shifu
Shifu

Reputation: 31

If you know Julia language, you can try OpenGene(https://github.com/OpenGene/OpenGene.jl), which has built-in gencode parsing.

using OpenGene, OpenGene.Reference

# load the gencode dataset, it will download a file from gencode website if it not downloaded before
# once it's loaded, it will be cached so future loads will be fast
index = gencode_load("GRCh37")

# locate which gene chr:pos is in
gencode_locate(index, "chr5", 149526621)
# it will return
# 1-element Array{Any,1}:
#  Dict{ASCIIString,Any}("gene"=>"PDGFRB","number"=>1,"transcript"=>"ENST00000261799.4","type"=>"intron")
genes = gencode_genes(index, "TP53")
# return an array with only one record
genes[1].name, genes[1].chr, genes[1].start_pos, genes[1].end_pos
# ("TP53","chr17",7565097,7590856)

Upvotes: 1

Arun
Arun

Reputation: 118789

I find the readGFF function from the bioconductor package rtracklayer quite appropriate here.

require(rtracklayer)
system.time(gtf <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L))
#    user  system elapsed 
#  34.558   1.541  36.737 
dim(gtf)
# [1] 2572840      26

You can also select just the tags you like.

system.time(gtf_tags <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L, 
      tags = c("gene_id", "transcript_id")))
#    user  system elapsed 
#  16.830   0.733  17.780 
dim(gtf_tags)
# [1] 2572840      10

Upvotes: 7

eddi
eddi

Reputation: 49448

Based on MRE from fanli's answer:

rbindlist(info[, .(list(eval(parse(text=
                          paste0('list(',
                                 sub(',$', '',
                                     gsub('([^ ]+) ([^;]+); *', '\\1=\\2,', AdditionalInfo)),
                                 ')')), .SD)))
               , by = AdditionalInfo]$V1, fill = T)
#             gene_id                          gene_type gene_status gene_name level          havana_gene     transcript_id      transcript_type
#1: ENSG00000223972.5 transcribed_unprocessed_pseudogene       KNOWN   DDX11L1     2 OTTHUMG00000000961.2                NA                   NA
#2: ENSG00000223972.5 transcribed_unprocessed_pseudogene       KNOWN   DDX11L1     2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#3: ENSG00000223972.5 transcribed_unprocessed_pseudogene       KNOWN   DDX11L1     2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#4: ENSG00000223972.5 transcribed_unprocessed_pseudogene       KNOWN   DDX11L1     2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#5: ENSG00000223972.5 transcribed_unprocessed_pseudogene       KNOWN   DDX11L1     2 OTTHUMG00000000961.2 ENST00000456328.2 processed_transcript
#   transcript_status transcript_name   tag transcript_support_level    havana_transcript exon_number           exon_id
#1:                NA              NA    NA                       NA                   NA          NA                NA
#2:             KNOWN     DDX11L1-002 basic                        1 OTTHUMT00000362751.1          NA                NA
#3:             KNOWN     DDX11L1-002 basic                        1 OTTHUMT00000362751.1           1 ENSE00002234944.1
#4:             KNOWN     DDX11L1-002 basic                        1 OTTHUMT00000362751.1           2 ENSE00003582793.1
#5:             KNOWN     DDX11L1-002 basic                        1 OTTHUMT00000362751.1           3 ENSE00002312635.1

The above basically replaces spaces with =, and ; with commas, slaps a list in front of that string and parses it as an R expression, converting it to an actual list. The rest is just massaging it into the correct shape.

Upvotes: 0

fanli
fanli

Reputation: 1099

MRE:

> dput(info[1:5,])
structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1"), 
    V2 = c("HAVANA", "HAVANA", "HAVANA", "HAVANA", "HAVANA"), 
    V3 = c("gene", "transcript", "exon", "exon", "exon"), V4 = c(11869L, 
    11869L, 11869L, 12613L, 13221L), V5 = c(14409L, 14409L, 12227L, 
    12721L, 14409L), V6 = c(".", ".", ".", ".", "."), V7 = c("+", 
    "+", "+", "+", "+"), V8 = c(".", ".", ".", ".", "."), AdditionalInfo = c("gene_id \"ENSG00000223972.5\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; level 2; havana_gene \"OTTHUMG00000000961.2\";", 
    "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";", 
    "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 1; exon_id \"ENSE00002234944.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";", 
    "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 2; exon_id \"ENSE00003582793.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";", 
    "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_status \"KNOWN\"; gene_name \"DDX11L1\"; transcript_type \"processed_transcript\"; transcript_status \"KNOWN\"; transcript_name \"DDX11L1-002\"; exon_number 3; exon_id \"ENSE00002312635.1\"; level 2; tag \"basic\"; transcript_support_level \"1\"; havana_gene \"OTTHUMG00000000961.2\"; havana_transcript \"OTTHUMT00000362751.1\";"
    )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", 
"V8", "AdditionalInfo"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x16489e8>)

Use vectorized operations like lapply instead of for loops.

keys <- lapply(info$AdditionalInfo, function(x) 
          unlist(lapply(unlist(strsplit(x, "; ")), 
            function(y) unlist(strsplit(y, " "))[1])) )
values <- lapply(info$AdditionalInfo, function(x) 
          unlist(lapply(unlist(strsplit(x, "; ")), 
            function(y) gsub("\"", "", unlist(strsplit(y, " "))[2]))) )

You can figure out what to do with resulting keys and values.

> keys[[1]]
[1] "gene_id"     "gene_type"   "gene_status" "gene_name"   "level"      
[6] "havana_gene"
> values[[1]]
[1] "ENSG00000223972.5"                  "transcribed_unprocessed_pseudogene"
[3] "KNOWN"                              "DDX11L1"                           
[5] "2"                                  "OTTHUMG00000000961.2;"  

It's okay, all bioinformaticians have to start somewhere. :)

Upvotes: 1

Related Questions