MacPat
MacPat

Reputation: 9

Finding percent string similarity

I'd like to find the number of matches between two strings so that I can calculate percent similarity. I'd like to do this without having to download anything as this has given me trouble. I tried downloading the biopython zip file and couldn't figure out how to get it setup. I'd like something that I can turn into a function to be easily used for different sequences. Doesn't need to be able to process anything more than say, 50 characters per string.

The sequences that I'm trying to compare are:

virX = 'TTTTCTTATTGT'
virZ = 'GTGGCAGACGGT'
virY = 'CTTCCTCACCGA'
virU = 'ATTACCAAAAGA'

outputs that I'm looking for are: 1) percent similarity between each sequence 2) two sequences with highest similarity

This worked but is time consuming to adjust to other sequences:

dnaA = 'ATATGCC'
dnaB = 'AAAGCGC'

count = 0
if dnaA[0] == dnaB[0]:
    count +=1
if dnaA[1] == dnaB[1]:
    count +=1
if dnaA[2] == dnaB[2]:
    count +=1
if dnaA[3] == dnaB[3]:
    count +=1
if dnaA[4] == dnaB[4]:
    count +=1
if dnaA[5] == dnaB[5]:
    count +=1
if dnaA[6] == dnaB[6]:
    count +=1

print(count, (count / len(dnaA) * 100), '%')

I tried this, which didn't work:

count = 0
for i in dnaA:
    if i == dnaB[i]:
        count += 1

I tried this:

from itertools import izip
def hamming_distance(str1, str2):
    assert len(str1) == len(str2)
    return sum(chr1 != chr2 for chr1, chr2 in izip(str1, str2))

print(hamming_distance(dnaA, dnaB))

which returned the error:

"Traceback (most recent call last): File "C:/Users/mac03/AppData/Local/Programs/Python/Python37/Wk5FriLab.py", line 79, in from itertools import izip ImportError: cannot import name 'izip' from 'itertools' (unknown location)"

I attempted to change izip to zip, this didn't work. I also tried this function in the jupyter notebook and was given the error:

"ImportError Traceback (most recent call last) in 5 6 ----> 7 from itertools import zip 8 def hamming_distance(str1, str2): 9 assert len(str1) == len(str2)

ImportError: cannot import name 'zip' from 'itertools' (unknown location)"

I tried these inputs and recieved errors as well:

python -m ensurepip

" File "", line 6 python -m ensurepip I'm ^ SyntaxError: invalid syntax"

pip install pip --upgrade

" File "", line 7 pip install pip --upgrade ^ SyntaxError: invalid syntax"

pip install biopython

" File "", line 7 pip install biopython ^ SyntaxError: invalid syntax"

Upvotes: 0

Views: 947

Answers (2)

PMende
PMende

Reputation: 5460

dnaA = 'ATATGCC'
dnaB = 'AAAGCGC'
matches = [
    nucl_A == nucl_B
    for nucl_A, nucl_B in zip(dnaA, dnaB)
]
similarity = sum(matches)/len(matches)
similarity

Result: 0.42857142857142855

As a function:

def hamming_dist(gene_a, gene_b):  
    matches = [
        nucl_a == nucl_b
        for nucl_a, nucl_b in zip(gene_a, gene_b)
    ]
    return sum(matches)/len(matches)

Upvotes: 0

pooya
pooya

Reputation: 153

Try this for calculating "count" (I assumed strings length are equal):

dnaA = 'ATATGCC'
dnaB = 'AAAGCGC'
count = 0
indexB = 0
for i in dnaA:
  if i == dnaB[indexB]:
    count +=1
  indexB +=1

Upvotes: 0

Related Questions