Reputation: 475
I need to extract part of a sequence from a FASTA file, using python (biopython, http://biopython.org/DIST/docs/tutorial/Tutorial.html)
I need to get the first 10 bases from each sequence and put them in one file, preserving the sequence info from the FASTA format. Worst comes to worst, I could just use the bases if there's no way to keep the sequence info. So here's an example:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
I need some way to get the first 10 bases (and then I was planning on doing it again for the last 10 bases). That tutorial site is pretty thorough but I'm new to this and since it doesn't go into this, I'm not even sure if it's possible.
Upvotes: 9
Views: 22746
Reputation: 282
The Biopython Seq object is basically an array so you can specify subsections of it and pass these into a new Seq object. Presuming that you have read these into a seqrecord (dictionary) then using the following code you can just specify the start end position.
SeqRecords[Seq][start:end].seq
This will give you the sequence object of the SeqRecord between the start and end positions, which are integers. From memory there is some funnyness about the start/end indexing, but play around to get the idea. You should also be able to specify:
SeqRecords[Seq][:end].seq
To get the sequence from the start of the SeqRecord.
For completeness - reading in the files like this:
inputSeqFile = open(filename, "rU")
SeqDict = SeqIO.to_dict(SeqIO.parse(inputSeqFile, "fasta"))
inputSeqFile.close()
Hope that helps.
Upvotes: 1
Reputation: 1538
Biopython is just perfect for these kinds of tasks. The Seq
-Object stores a sequence and info about it. Reading the fasta file format is straight forward. You can access the sequence like a simple list and, hence, access certain positions straight forward as well:
from Bio import SeqIO
with open("outfile.txt","w") as f:
for seq_record in SeqIO.parse("infile.fasta", "fasta"):
f.write(str(seq_record.id) + "\n")
f.write(str(seq_record.seq[:10]) + "\n") #first 10 base positions
f.write(str(seq_record.seq[-10:]) + "\n") #last 10 base positions
Upvotes: 9