Iterate and compare all values within keys in a dict. Python

Question

I have a dictionnarysuch as :

dict = {'group1':["-----AGAAC--C--","-----ATATA-----"],'group2':["------AGGCGGA----","--AAACC----------","--AAACCG---------"]}

, in this dictionnary I would like to compare all values with each other by iterate all of them. The idea is to compare each value and count when a letter is in front of a nonalpha character.

Here are the results I should get (the number of letter with in front of it a nonalpha character (characterised by the pipe in the below exemple) / length of the string )

group1:

element1 vs element2

-----AGAAC--C--
-----*****--|--
-----ATATA-----
1/15 = 0.07

group2:

element 1 vs element2

------AGGCGGA----
--||||*||||||----
--AAACC----------
10/17= 0.59

element 2 vs element 3

--AAACC----------
--*****|---------
--AAACCG---------
1/17= 0.059

element 1 vs element 3

------AGGCGGA----
--||||**|||||----
--AAACCG---------
9/17=0.53

Here is a code I use to compare them an calculate a score for the group1:

value1="-----AGAAC--C--"
value2="-----ATATA-----"

count=0
for a,b in zip(value1,value2):
    print(a.isalpha(),b.isalpha())
    if a.isalpha() == True and b.isalpha()==False:
        count += 1
    if a.isalpha()==False and b.isalpha()== True :
        count +=1

print(count/len(value1))

but I cannot manage to do it for all value automaticaly... Does anyone have an idea? Thank you for your help.

javidcf · Accepted Answer

Here is a way to do that:

from itertools import combinations

# Input data
dict = {
    'group1': ['-----AGAAC--C--', '-----ATATA-----'],
    'group2': ['------AGGCGGA----', '--AAACC----------', '--AAACCG---------']
}

# Iterate groups
for group, elements in dict.items():
    # Print group name
    print(group)
    # Iterate every pair of elements
    for element1, element2 in combinations(elements, 2):
        # Check both elements have the same length
        n = len(element1)
        if len(element2) != n:
            raise ValueError
        # Count the number of times character types do not match
        count = sum(c1.isalpha() != c2.isalpha() for c1, c2 in zip(element1, element2))
        # Compute ratio
        ratio = count / n
        # Print result
        print(f'    * {element1} vs {element2}: {ratio:.4f} ({count}/{n})')
    print()

Output:

group1
    * -----AGAAC--C-- vs -----ATATA-----: 0.0667 (1/15)

group2
    * ------AGGCGGA---- vs --AAACC----------: 0.5882 (10/17)
    * ------AGGCGGA---- vs --AAACCG---------: 0.5294 (9/17)
    * --AAACC---------- vs --AAACCG---------: 0.0588 (1/17)

EDIT: If you want to collect the list of pairs that produce a score above some threshold you can modify the code above slightly:

from itertools import combinations

dict = {
    'group1': ['-----AGAAC--C--', '-----ATATA-----'],
    'group2': ['------AGGCGGA----', '--AAACC----------', '--AAACCG---------']
}
threshold = 0.10

interesting_pairs = []
for group, elements in dict.items():
    for element1, element2 in combinations(elements, 2):
        n = len(element1)
        if len(element2) != n:
            raise ValueError
        count = sum(c1.isalpha() != c2.isalpha() for c1, c2 in zip(element1, element2))
        ratio = count / n
        if ratio > threshold:
            interesting_pairs.append((element1, element2))

print(interesting_pairs)
# [('------AGGCGGA----', '--AAACC----------'), ('------AGGCGGA----', '--AAACCG---------')]

EDIT 2: Following the discussion in the comments, here is yet another variation that groups together elements with a ratio over some threshold, transitively. This is actually another different problem on its own, namely finding the connected components of the graph given by that relationship. You can do that with for example a depth-first search:

from itertools import combinations

dict = {
    'group1': ['-----AGAAC--C--', '-----ATATA-----'],
    'group2': ['------AGGCGGA----', '--AAACC----------', '--AAACCG---------']
}
threshold = 0.10

# Find connected elements
connected = {}
for group, elements in dict.items():
    for element1, element2 in combinations(elements, 2):
        n = len(element1)
        if len(element2) != n:
            raise ValueError
        count = sum(c1.isalpha() != c2.isalpha() for c1, c2 in zip(element1, element2))
        ratio = count / n
        if ratio > threshold:
            connected.setdefault(element1, {element1}).add(element2)
            connected.setdefault(element2, {element2}).add(element1)

# Search components with DFS
result = []
visited = set()
for elem, conn in  connected.items():
    if elem in visited:
        continue
    visited.add(elem)
    conn = set(conn)
    pending = list(conn)
    while pending:
        subelem = pending.pop()
        if subelem in visited:
            continue
        visited.add(subelem)
        subconn = connected[subelem]
        conn.update(subconn)
        pending.extend(subconn)
    result.append(conn)
print(result)
# [{'------AGGCGGA----', '--AAACCG---------', '--AAACC----------'}]

Iterate and compare all values within keys in a dict. Python

group1:

group2:

Answers (1)

Related Questions