Reputation: 3075
I am using biomaRt
in R to query ensembl's hsapiens
database of human genes. I am using the function getBM
to get all genes' name, start position and stop position, but I cannot find the right attribute for retrieving the TSS (transcription start site). Is it maybe because it is considered the same as the seqType= c("3utr", "5utr")
?
Upvotes: 6
Views: 5159
Reputation: 1152
There is now a specific attribute for the transcription start site that can be downloaded: transcription_start_site
.
library("biomaRt")
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
attributes = listAttributes(ensembl, page = "structure")
attributes[grep("transcript", attributes$description, ignore.case = TRUE), ]
# name description
# 178 ensembl_transcript_id Ensembl Transcript ID
# 183 transcript_start Transcript Start (bp)
# 184 transcript_end Transcript End (bp)
# 185 transcription_start_site Transcription Start Site (TSS)
# 186 transcript_length Transcript length (including UTRs and CDS)
# 195 transcript_count Transcript count
# 201 rank Exon Rank in Transcript
As an example, here is the result for the gene BTC. Note that because it is on the reverse strand (strand == -1
), the value for transcription_start_site
is the same as the value for transcript_end
. Basically, downloading transcription_start_site
is a shortcut so that you don't have to determine which end of the transcript is the TSS based on which strand the gene is on.
tss <- getBM(attributes = c("transcription_start_site", "chromosome_name",
"transcript_start", "transcript_end",
"strand", "ensembl_gene_id",
"ensembl_transcript_id", "external_gene_name"),
filters = "external_gene_name", values = "BTC",
mart = ensembl)
tss
# transcription_start_site chromosome_name transcript_start transcript_end strand
# 1 75635873 HG706_PATCH 75612096 75635873 -1
# 2 75660403 HG706_PATCH 75610476 75660403 -1
# 3 75719896 4 75669969 75719896 -1
# 4 75695366 4 75671589 75695366 -1
# ensembl_gene_id ensembl_transcript_id external_gene_name
# 1 ENSG00000261530 ENST00000567516 BTC
# 2 ENSG00000261530 ENST00000566356 BTC
# 3 ENSG00000174808 ENST00000395743 BTC
# 4 ENSG00000174808 ENST00000512743 BTC
Upvotes: 6
Reputation: 26
I believe the "transcript_start" and "transcript_end" are the translation start and stop site, but not necessarily the TSS (transcription start site).
Looking at the "start_position" and "end_position" attributes, these seem to be the TSS (start_position for + strand and end_position for - strand), because they are always the smallest number of the transcript_start options for different transcript for the + strand and the largest number of the transcript_end options for the - strand.
Upvotes: 0
Reputation: 14842
A complete list of queriable attributes can be retrieved in a data frame using listAttributes
. Then it's just a matter of searching it for the attributes you want.
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)
grep("transcript", att$name, value=TRUE)
will get you a rather long list, begining like this
[1] "ensembl_transcript_id"
[2] "transcript_start"
[3] "transcript_end"
[4] "external_transcript_id"
[5] "transcript_db_name"
[6] "transcript_count"
[7] "transcript_biotype"
[8] "transcript_status"
[9] "clone_based_ensembl_transcript_name"
[10] "clone_based_vega_transcript_name"
Then you can go ahead and query using these names
getBM(attributes=c("transcript_start", "transcript_end"),
filters="hgnc_symbol", values="foxp2", mart=mart)
and you get
transcript_start transcript_end
1 113726382 114330960
2 113726494 114271639
3 113726615 114330155
4 113728221 114066565
5 113728221 114271650
6 114054329 114330218
7 114055052 114139783
8 114055052 114333827
9 114055110 114330155
10 114055113 114330200
11 114055275 114269037
12 114055374 114285885
13 114055378 114330012
14 114066555 114294198
15 114066557 114271754
16 114066557 114282629
17 114066570 114294198
18 114055052 114333823
19 114268613 114329981
20 113726615 114310038
If you want all the transcripts of all genes remove the filter
and values
arguments, but be aware that you will get a lot of data coming your way.
Upvotes: 12