Reputation: 7681
I have a list of gene symbols which represent the intersection of two high throughput data sets. I'm interested in doing some sort of GO annotation and clustering, but in order to do this I need to convert these gene symbols into UniProt accession numbers. My question is, what is the best way to do this using Python?
For example, the gene for 'Transforming growth factor beta-1' is called 'TGFB1' and its accession number is 'P01137'. I'm looking for a function/class/module/package that will enable me to input TGFB1 as argument and give me P01137 back. Could somebody give me some directions to look at? Thanks
Upvotes: 0
Views: 555
Reputation: 7443
Get some mapping from gene name to PDB ID, like this JSON: http://www.rcsb.org/pdb/browse/homo_sapiens_download.jsp?rows=100000&page=1&sidx=id&sord=desc saving it for example as "mapping.json".
Then use that data to get the mapping:
import json
with open("mapping.json") as mapping:
map_dict = json.load(mapping)
data = map_dict["rows"]
def get_uniprot(gene_id):
for row in map_dict["rows"]:
if row["cell"][1] == gene_id:
return row["cell"][4]
print(get_uniprot("TGFB1"))
Upvotes: 1