Reputation: 879
I have a dictionnarysuch as :
dict = {'group1':["-----AGAAC--C--","-----ATATA-----"],'group2':["------AGGCGGA----","--AAACC----------","--AAACCG---------"]}
, in this dictionnary I would like to compare all values with each other by iterate all of them. The idea is to compare each value and count when a letter is in front of a nonalpha character.
Here are the results I should get (the number of letter with in front of it a nonalpha character (characterised by the pipe
in the below exemple) / length of the string )
element1 vs element2
-----AGAAC--C--
-----*****--|--
-----ATATA-----
1/15 = 0.07
element 1 vs element2
------AGGCGGA----
--||||*||||||----
--AAACC----------
10/17= 0.59
element 2 vs element 3
--AAACC----------
--*****|---------
--AAACCG---------
1/17= 0.059
element 1 vs element 3
------AGGCGGA----
--||||**|||||----
--AAACCG---------
9/17=0.53
Here is a code I use to compare them an calculate a score for the group1:
value1="-----AGAAC--C--"
value2="-----ATATA-----"
count=0
for a,b in zip(value1,value2):
print(a.isalpha(),b.isalpha())
if a.isalpha() == True and b.isalpha()==False:
count += 1
if a.isalpha()==False and b.isalpha()== True :
count +=1
print(count/len(value1))
but I cannot manage to do it for all value automaticaly... Does anyone have an idea? Thank you for your help.
Upvotes: 0
Views: 63
Reputation: 59731
Here is a way to do that:
from itertools import combinations
# Input data
dict = {
'group1': ['-----AGAAC--C--', '-----ATATA-----'],
'group2': ['------AGGCGGA----', '--AAACC----------', '--AAACCG---------']
}
# Iterate groups
for group, elements in dict.items():
# Print group name
print(group)
# Iterate every pair of elements
for element1, element2 in combinations(elements, 2):
# Check both elements have the same length
n = len(element1)
if len(element2) != n:
raise ValueError
# Count the number of times character types do not match
count = sum(c1.isalpha() != c2.isalpha() for c1, c2 in zip(element1, element2))
# Compute ratio
ratio = count / n
# Print result
print(f' * {element1} vs {element2}: {ratio:.4f} ({count}/{n})')
print()
Output:
group1
* -----AGAAC--C-- vs -----ATATA-----: 0.0667 (1/15)
group2
* ------AGGCGGA---- vs --AAACC----------: 0.5882 (10/17)
* ------AGGCGGA---- vs --AAACCG---------: 0.5294 (9/17)
* --AAACC---------- vs --AAACCG---------: 0.0588 (1/17)
EDIT: If you want to collect the list of pairs that produce a score above some threshold you can modify the code above slightly:
from itertools import combinations
dict = {
'group1': ['-----AGAAC--C--', '-----ATATA-----'],
'group2': ['------AGGCGGA----', '--AAACC----------', '--AAACCG---------']
}
threshold = 0.10
interesting_pairs = []
for group, elements in dict.items():
for element1, element2 in combinations(elements, 2):
n = len(element1)
if len(element2) != n:
raise ValueError
count = sum(c1.isalpha() != c2.isalpha() for c1, c2 in zip(element1, element2))
ratio = count / n
if ratio > threshold:
interesting_pairs.append((element1, element2))
print(interesting_pairs)
# [('------AGGCGGA----', '--AAACC----------'), ('------AGGCGGA----', '--AAACCG---------')]
EDIT 2: Following the discussion in the comments, here is yet another variation that groups together elements with a ratio over some threshold, transitively. This is actually another different problem on its own, namely finding the connected components of the graph given by that relationship. You can do that with for example a depth-first search:
from itertools import combinations
dict = {
'group1': ['-----AGAAC--C--', '-----ATATA-----'],
'group2': ['------AGGCGGA----', '--AAACC----------', '--AAACCG---------']
}
threshold = 0.10
# Find connected elements
connected = {}
for group, elements in dict.items():
for element1, element2 in combinations(elements, 2):
n = len(element1)
if len(element2) != n:
raise ValueError
count = sum(c1.isalpha() != c2.isalpha() for c1, c2 in zip(element1, element2))
ratio = count / n
if ratio > threshold:
connected.setdefault(element1, {element1}).add(element2)
connected.setdefault(element2, {element2}).add(element1)
# Search components with DFS
result = []
visited = set()
for elem, conn in connected.items():
if elem in visited:
continue
visited.add(elem)
conn = set(conn)
pending = list(conn)
while pending:
subelem = pending.pop()
if subelem in visited:
continue
visited.add(subelem)
subconn = connected[subelem]
conn.update(subconn)
pending.extend(subconn)
result.append(conn)
print(result)
# [{'------AGGCGGA----', '--AAACCG---------', '--AAACC----------'}]
Upvotes: 2