user2662391
user2662391

Reputation: 55

Extracting sequences from a fasta file by ID number in header

I have a fasta file with multiple sequences with headers that look like this:

>1016BSA34080.1
MTHSVRIITVTVNFLQHRFFIDYMSEIGLLDGEIEQMVSALQEQVHIVARARTLPEMKNLERDTHVIVKT
LKKQLTAFHSEVKKIADSTQRSRYEGKHQTYEAKVKDLEKELRTQIDPPPKSVSEKHMEDLMGEGGPDGS
GFKTTDQVLRAGIRIQNDA

>1038BSA81955.1
MQQQQARRRMEEPTAAAATASSTTSFAAQPLLSRSVAPQAASSPQASARLAESAGFRSAAVFGSAQAAVG
GRGRGGFGAPPGRGGFGAPPAAGFGAAPAFGAPPTLQAFSAAPAPGGFGAPPAPQGFGAPRAAGFGAPPA
PQAFSAVAPASSTAIPLDVTTYLGDTFGSAPTRGPP

The 4 digit number at the start of the header is a unique ID for the sequence.

Could you help me write a python script to extract sequences by the 4 digit ID (in a text file with one ID per line)?

I tried modifying this script (I found on this website: Extract sequences from a FASTA file based on entries in a separate file) to suit my purpose (in vain):

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

I'm new to Python, any help will be greatly appreciated! Thanks -Divya

Upvotes: 0

Views: 6174

Answers (2)

TheRibosome
TheRibosome

Reputation: 341

Using Biopython you could do it like so (requires biopyhton installed):

from Bio import SeqIO

f1 = "fasta.fa"
f2 = "accessionids.txt"
f3 = "selected_seqs.fa"
selected_seqs = list()

with open(f2, "r") as seq_ids:
    accessionids = [line.rstrip("\n") for line in seq_ids]

for seq_record in SeqIO.parse(f1, "fasta")
    header = seq_record.name # (or .id or so)
    for accession_id in accessionids:
        if accession_id == header[0:4]:
            selected_seqs.append(seq_record)


SeqIO.write(selected_seqs, f3, "fasta")

This will go through your sequence records (fasta file) and for each entry check if there is a match with an id from accessionids file.

Note:

  • seq_record can have different tags, check in which one your identifier is located.
  • If you have an id that is not at the start (and is unique to a single sequence header) you simply use if accession_id in header:
  • See biopython tutorial chapter 5 for more detail on SeqIO.

Upvotes: 1

Peter DeGlopper
Peter DeGlopper

Reputation: 37319

Does accessionids.txt contain just the four-digit codes?

If so, change accessorID to:

accessorID = accessorIDWithArrow[1:5]

Some ways to make this more Pythonic are:

Use a set instead of a dictionary for AI_DICT, use strip() rather than slicing to remove the newline, and use a generator expression to build the set

AI_SET = set((line.strip() for line in f2))

Use True and False rather than 0 and 1 for skip.

I would redo the main loop thus:

in_accession_ids = False
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:5]
        # print accessorID
        in_accession_ids = accessorID in AI_SET
    if in_accession_ids:
        f3.write(line)

I think the logic's a little more obvious this way. Also, starting with skip = 0 in the original, or in_accession_ids=True in mine, would mean that you'd print everything prior to finding the first sequence header. That might be want you want, that might not - I assumed not in my rewrite.

You may eventually want to look into the Biopython collection - it's overkill for this specific task but quite nice overall. Lots of tools for reading FASTA files and related formats, among other things.

http://biopython.org/wiki/Biopython

Upvotes: 1

Related Questions