How to find identical sequences in Python

Question

I am new to Python, and i want to know how to find identical sequences from Fasta file in Python. for example, here i have 4 record sequence reads, how to find the identical sequences and return their ids? Thank you very much!!

from Bio import SeqIO
record=list(SeqIO.parse("data/dna.txt", "fasta"))
for i in range(0,len(record)):
    print record[i].id,record[i].seq


seq1 GAATGCATACTGCATCGATA
seq2 CATAAAACGTCTCCATCGCT
seq3 TGCCCAAGTTGTGAAGTGTC
seq4 TGCCCAAGTTGTGAAGTGTC

Brian Cain · Accepted Answer

You can compile the list of IDs per sequence using a defaultdict, like so:

from Bio import SeqIO
from collections import defaultdict
records=list(SeqIO.parse("data/dna.txt", "fasta"))
compilation = defaultdict(list)
for record in records:
    compilation[record.seq].append(record.id)

How to find identical sequences in Python

Answers (2)

Related Questions