Reputation: 696
I have an 8bp string testStr
ACTGACTG
I want to compare it against a green list of 8bp strings greenList
.
GGCGCATG
ACTGAAAT
ATGCCCGT
ACTGAGTG
If testStr
is within hamming distance 1 (has a difference at <= 1 position) of any string in greenList
I want a for loop to proceed. Below is an acceptable match because the sequences differ at only one position.
ACTGACTG
||||| ||
ACTGAGTG
My first attempt to do this centered around creating a green list containing all possible hamming distance 1 variations for the strings in greenList... For example the sequence GGCGCATG
yields the following hamming distance 1 variations.
AGCGCATG CGCGCATG TGCGCATG GACGCATG GCCGCATG GTCGCATG GGAGCATG GGTGCATG GGGGCATG GGCACATG GGCCCATG GGCTCATG GGCGAATG GGCGTATG GGCGGATG GGCGCCTG GGCGCTTG GGCGCGTG GGCGCCAG GGCGCCCG GGCGCCGG GGCGCCTA GGCGCCTC GGCGCCTT
If any of the above sequences directly match "testStr" the loop would proceed.
But there must be a better way to do this... I also know there are a wealth of alignment algorithms available for DNA alignments. However, most that i have found seem like overkill for this simple situation. Any guidance much appreciated.
Upvotes: 0
Views: 397
Reputation: 40648
You don't have to generate all variations of your green
list. Instead you can calculate the actual hamming distance between your input sequence x
and each sequence in green
.
I was about to provide a solution which encoded the string sequences into lists of integers but it will work perfectly with strings as is, since strings are iterable!
Therefore, we only need to implement a hamming distance function, which will count the number of dissimilar characters between two given strings. This will do:
def hamming(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
Then we can iterate over green
and only keep elements with a distance to x
<= 1
:
x = 'ACTGACTG'
green = ['GGCGCATG', 'ACTGAAAT', 'ATGCCCGT', 'ACTGAGTG']
res = [s for s in green if hamming(s, x) <= 1]
Thank you @Daniel for the correction!
Upvotes: 3