Reputation: 1511
I want to download in fasta format all the peptide sequences in the NCBI protein database (i.e. > and the peptide name, followed by the peptide sequence), I saw there is a MESH term describing what a peptide is here, but I can't work out how to incorporate it.
I wrote this:
import Bio
from Bio import Entrez
Entrez.email = '[email protected]'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))
but it only prints out 995 IDs, no sequences to file, I'm wondering if someone could demonstrate where I'm going wrong.
Upvotes: 1
Views: 995
Reputation: 22847
Note that a search for the term peptide in the NCBI protein database returns 8187908 hits, so make sure that you actually want to download the peptide sequences for all these hits into one big fasta file.
>>> from Bio import Entrez
>>> Entrez.email = '[email protected]'
>>> handle = Entrez.esearch(db="protein", term="peptide")
>>> record = Entrez.read(handle)
>>> record["Count"]
'8187908'
The default number of records that Entrez.esearch returns is 20. This is to prevent overloading NCBI's servers.
>>> len(record["IdList"])
20
To get the full list of records, change the retmax
parameter:
>>> count = record["Count"]
>>> handle = Entrez.esearch(db="protein", term="peptide", retmax=count)
>>> record = Entrez.read(handle)
>>> len(record['IdList'])
8187908
The way to download all the records is to use Entrez.epost
From chapter 9.4 of the BioPython tutorial:
EPost uploads a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through the Bio.Entrez.epost() function.
To give an example of when this is useful, suppose you have a long list of IDs you want to download using EFetch (maybe sequences, maybe citations – anything). When you make a request with EFetch your list of IDs, the database etc, are all turned into a long URL sent to the server. If your list of IDs is long, this URL gets long, and long URLs can break (e.g. some proxies don’t cope well).
Instead, you can break this up into two steps, first uploading the list of IDs using EPost (this uses an “HTML post” internally, rather than an “HTML get”, getting round the long URL problem). With the history support, you can then refer to this long list of IDs, and download the associated data with EFetch.
[...] The returned XML includes two important strings, QueryKey and WebEnv which together define your history session. You would extract these values for use with another Entrez call such as EFetch.
Read [chapter 9.15.: Searching for and downloading sequences using the history][3] to learn how to use QueryKey
and WebEnv
A full working example would then be:
from Bio import Entrez
import time
from urllib.error import HTTPError
DB = "protein"
QUERY = "peptide"
Entrez.email = '[email protected]'
handle = Entrez.esearch(db=DB, term=QUERY, rettype='fasta')
record = Entrez.read(handle)
count = record['Count']
handle = Entrez.esearch(db=DB, term=QUERY, retmax=count, rettype='fasta')
record = Entrez.read(handle)
id_list = record['IdList']
post_xml = Entrez.epost(DB, id=",".join(id_list))
search_results = Entrez.read(post_xml)
webenv = search_results['WebEnv']
query_key = search_results['QueryKey']
batch_size = 200
with open('peptides.fasta', 'w') as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print(f"Going to download record {start+1} to {end}")
attempt = 0
success = False
while attempt < 3 and not success:
attempt += 1
try:
fetch_handle = Entrez.efetch(db=DB, rettype='fasta',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
success = True
except HTTPError as err:
if 500 <= err.code <= 599:
print(f"Received error from server {err}")
print(f"Attempt {attempt} of 3")
time.sleep(15)
else:
raise
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
The first few lines of peptides.fasta
then look like this:
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
MFKSFFPKPGPFFISAFIWSMLAVIFWQAGGGDWLLRVTGASQNVAISAARFWSLNYLVFYAYYLFCVGV
FALFWFVYCPHRWQYWSILGTSLIIFVTWFLVEVGVAINAWYAPFYDLIQSALATPHKVSINQFYQEIGV
FLGIAIIAVIIGVMNNFFVSHYVFRWRTAMNEHYMAHWQHLRHIEGAAQRVQEDTMRFASTLEDMGVSFI
NAVMTLIAFLPVLVTLSEHVPDLPIVGHLPYGLVIAAIVWSLMGTGLLAVVGIKLPGLEFKNQRVEAAYR
KELVYGEDDETRATPPTVRELFRAVRRNYFRLYFHYMYFNIARILYLQVDNVFGLFLLFPSIVAGTITLG
LMTQITNVFGQVRGSFQYLISSWTTLVELMSIYKRLRSFERELDGKPLQEAIPTLR
Upvotes: 3