user1784467
user1784467

Reputation: 475

Using Biopython (Python) to extract sequence from FASTA file

I need to extract part of a sequence from a FASTA file, using python (biopython, http://biopython.org/DIST/docs/tutorial/Tutorial.html)

I need to get the first 10 bases from each sequence and put them in one file, preserving the sequence info from the FASTA format. Worst comes to worst, I could just use the bases if there's no way to keep the sequence info. So here's an example:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

I need some way to get the first 10 bases (and then I was planning on doing it again for the last 10 bases). That tutorial site is pretty thorough but I'm new to this and since it doesn't go into this, I'm not even sure if it's possible.

Upvotes: 9

Views: 22746

Answers (2)

niallhaslam
niallhaslam

Reputation: 282

The Biopython Seq object is basically an array so you can specify subsections of it and pass these into a new Seq object. Presuming that you have read these into a seqrecord (dictionary) then using the following code you can just specify the start end position.

SeqRecords[Seq][start:end].seq

This will give you the sequence object of the SeqRecord between the start and end positions, which are integers. From memory there is some funnyness about the start/end indexing, but play around to get the idea. You should also be able to specify:

SeqRecords[Seq][:end].seq

To get the sequence from the start of the SeqRecord.

For completeness - reading in the files like this:

inputSeqFile = open(filename, "rU")
SeqDict = SeqIO.to_dict(SeqIO.parse(inputSeqFile, "fasta"))
inputSeqFile.close()

Hope that helps.

Upvotes: 1

MoRe
MoRe

Reputation: 1538

Biopython is just perfect for these kinds of tasks. The Seq-Object stores a sequence and info about it. Reading the fasta file format is straight forward. You can access the sequence like a simple list and, hence, access certain positions straight forward as well:

from Bio import SeqIO

with open("outfile.txt","w") as f:
        for seq_record in SeqIO.parse("infile.fasta", "fasta"):
                f.write(str(seq_record.id) + "\n")
                f.write(str(seq_record.seq[:10]) + "\n")  #first 10 base positions
                f.write(str(seq_record.seq[-10:]) + "\n") #last 10 base positions

Upvotes: 9

Related Questions