CiaranWelsh
CiaranWelsh

Reputation: 7681

Convert a list of Gene Symbols to UniProt accession numbers using Python

I have a list of gene symbols which represent the intersection of two high throughput data sets. I'm interested in doing some sort of GO annotation and clustering, but in order to do this I need to convert these gene symbols into UniProt accession numbers. My question is, what is the best way to do this using Python?

For example, the gene for 'Transforming growth factor beta-1' is called 'TGFB1' and its accession number is 'P01137'. I'm looking for a function/class/module/package that will enable me to input TGFB1 as argument and give me P01137 back. Could somebody give me some directions to look at? Thanks

Upvotes: 0

Views: 555

Answers (1)

xbello
xbello

Reputation: 7443

Get some mapping from gene name to PDB ID, like this JSON: http://www.rcsb.org/pdb/browse/homo_sapiens_download.jsp?rows=100000&page=1&sidx=id&sord=desc saving it for example as "mapping.json".

Then use that data to get the mapping:

import json


with open("mapping.json") as mapping:
    map_dict = json.load(mapping)

data = map_dict["rows"]

def get_uniprot(gene_id):
    for row in map_dict["rows"]:
        if row["cell"][1] == gene_id:
            return row["cell"][4]

print(get_uniprot("TGFB1"))

Upvotes: 1

Related Questions