cong
cong

Reputation: 23

How to find identical sequences in Python

I am new to Python, and i want to know how to find identical sequences from Fasta file in Python. for example, here i have 4 record sequence reads, how to find the identical sequences and return their ids? Thank you very much!!

from Bio import SeqIO
record=list(SeqIO.parse("data/dna.txt", "fasta"))
for i in range(0,len(record)):
    print record[i].id,record[i].seq


seq1 GAATGCATACTGCATCGATA
seq2 CATAAAACGTCTCCATCGCT
seq3 TGCCCAAGTTGTGAAGTGTC
seq4 TGCCCAAGTTGTGAAGTGTC

Upvotes: 1

Views: 605

Answers (2)

Brian Cain
Brian Cain

Reputation: 14619

You can compile the list of IDs per sequence using a defaultdict, like so:

from Bio import SeqIO
from collections import defaultdict
records=list(SeqIO.parse("data/dna.txt", "fasta"))
compilation = defaultdict(list)
for record in records:
    compilation[record.seq].append(record.id)

Upvotes: 1

Pi Marillion
Pi Marillion

Reputation: 4674

The easiest way is with a dict.

from Bio import SeqIO
records = list(SeqIO.parse("data/dna.txt", "fasta"))
d = dict()
for record in records:
    if record.seq in d:
        d[record.seq].append(record)
    else:
        d[record.seq] = [record]
for seq, record_set in d.iteritems():
    print seq + ': (' + str(len(record_set)) + ')'
    for record in record_set:
        print '    ' + record.id

Prints like:

GAATGCATACTGCATCGATA: (1)
    seq1
CATAAAACGTCTCCATCGCT: (1)
    seq2
TGCCCAAGTTGTGAAGTGTC: (2)
    seq3
    seq4

Upvotes: 0

Related Questions