Cindy Almighty
Cindy Almighty

Reputation: 933

Finding annotations data in GFF format for NCBI nucleotides using Entrez

I am working with bacterial sequences from NCBI Nucleotide database. If I have an accession e.g. NC_002663 and I need the annotations in GFF, how would I easily do that using Entrez (preferably Biopython)?

If I go to the NCBI entry, I see the link to the assembly. Is there an easy way to programmatically access it? Esummary service doesn't return such links:

handle = Entrez.esummary(db='nucleotide', id='NC_002663')
record = Entrez.read(handle)

[DictElement({'Item': [], 'Id': '15601865', 'Caption': 'NC_002663', 'Title': 'Pasteurella multocida subsp. multocida str. Pm70, complete genome', 'Extra': 'gi|15601865|ref|NC_002663.1|[15601865]', 'Gi': IntegerElement(15601865, attributes={}), 'CreateDate': '2001/09/10', 'UpdateDate': '2018/01/11', 'Flags': IntegerElement(800, attributes={}), 'TaxId': IntegerElement(272843, attributes={}), 'Length': IntegerElement(2257487, attributes={}), 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'NC_002663.1'}, attributes={})]

I could maybe search the Assembly db with the "Title", but it seems there could be a better way (without as many API calls). Thanks!

Upvotes: 0

Views: 279

Answers (1)

Oka
Oka

Reputation: 1328

I am not sure whether NCBI Nucleotide allows GFF download programmatically (via `efetch´ function) yet. You can access fasta or genbank files that way, but GFFs were not listed.

You can

  • download it manually from their webpage (if you have only a few files to download)
  • fetch genbank file with Entrez.efetch function, and convert it to GFF
  • download it with file retrieval tool (like wget or other).

Also, there is a biomart package. Its R implementation mention function getGFF which can query several databases (though not the Nucleotide database). You could check if its python implementation has the same functionality available and if you could find the same files from there.

Upvotes: 1

Related Questions