Reputation:
I have a FASTA file with an alignment of multiple gene samples. I am trying to develop a program that can count the number of mutations for each sample. What's the best way to do this? Store each gene sample in a dictionary and compare them somehow?
Upvotes: 1
Views: 2439
Reputation: 219
try to read in FASTA file and store each sequence as string. You can certainly organize the sequences in a dictionary using text in the '<' line as key. If a gene is of the same length as a reference sequence without mutation, [i for i, a in enumerate(gene) if a != reference[i]] will return a list of position of mutations. its length will be the number of mutations. If mutation involves missing or added AA, it will be much more complicated.
Upvotes: 0
Reputation: 1518
If they are in an alignment format already, the identities and mismatches are already calculated. So you have something like this:
Aln1: ACTGGTTGTCCAACCGTAATCGAAG
Aln2: ---GGTTGTCCAATTC---TCGAAG
Capture each one into a string, and simply enumerate over them. Something simple like this works:
mutations=0
for i,j in zip(aln1,aln2):
if i != j and i != '-' and j != '-':
mutations+=1
It depends on your personal criteria though, if you want to include gaps as mutations, etc.
Upvotes: 1