Paul
Paul

Reputation: 696

Fast Python Short DNA String Comparison

I have an 8bp string testStr

ACTGACTG

I want to compare it against a green list of 8bp strings greenList.

GGCGCATG ACTGAAAT ATGCCCGT ACTGAGTG

If testStr is within hamming distance 1 (has a difference at <= 1 position) of any string in greenList I want a for loop to proceed. Below is an acceptable match because the sequences differ at only one position.

ACTGACTG

||||| ||

ACTGAGTG

My first attempt to do this centered around creating a green list containing all possible hamming distance 1 variations for the strings in greenList... For example the sequence GGCGCATG yields the following hamming distance 1 variations.

AGCGCATG CGCGCATG TGCGCATG GACGCATG GCCGCATG GTCGCATG GGAGCATG GGTGCATG GGGGCATG GGCACATG GGCCCATG GGCTCATG GGCGAATG GGCGTATG GGCGGATG GGCGCCTG GGCGCTTG GGCGCGTG GGCGCCAG GGCGCCCG GGCGCCGG GGCGCCTA GGCGCCTC GGCGCCTT

If any of the above sequences directly match "testStr" the loop would proceed.

But there must be a better way to do this... I also know there are a wealth of alignment algorithms available for DNA alignments. However, most that i have found seem like overkill for this simple situation. Any guidance much appreciated.

Upvotes: 0

Views: 397

Answers (1)

Ivan
Ivan

Reputation: 40648

You don't have to generate all variations of your green list. Instead you can calculate the actual hamming distance between your input sequence x and each sequence in green.

I was about to provide a solution which encoded the string sequences into lists of integers but it will work perfectly with strings as is, since strings are iterable!

Therefore, we only need to implement a hamming distance function, which will count the number of dissimilar characters between two given strings. This will do:

def hamming(s1, s2):
    return sum(c1 != c2 for c1, c2 in zip(s1, s2))

Then we can iterate over green and only keep elements with a distance to x <= 1:

x = 'ACTGACTG'
green = ['GGCGCATG', 'ACTGAAAT', 'ATGCCCGT', 'ACTGAGTG']

res = [s for s in green if hamming(s, x) <= 1]

Thank you @Daniel for the correction!

Upvotes: 3

Related Questions