Reputation: 65
Please I need help again.
I have a file named vf_to_cluster.txt
that look like:
From it I made a dictionary called vf_accession_to_cluster_groups
where keys are vf_accession
(AI0...) and values are the list of cluster groups (['1','2','3'...]
).
I've done that by coding this way (I know it's not a pretty code but that what I can do right now with what I know sorry):
f = 'script_folder/vf_to_cluster.txt'
vf_accession_to_cluster_groups = {}
with open(f, 'r') as f6:
for lines in f6.readlines():
lines = lines.replace('[', '')
lines = lines.replace(']', '')
lines = lines.replace(',', '')
lines_split = lines.strip().split(' ')
vf_keys = lines_split[0]
cluster_values = lines_split[1:]
vf_accession_to_cluster_groups[vf_keys] = cluster_values
After getting this dictionary my main objectif is to see how many vf_accessions
(AI0...) share same cluster groups. So I can say for example that AI001 and AI002 share 4 cluster groups meaning that those two vf_accession
are probably the same or really close (coded by same genes).
I made this code:
for vf_1 in vf_accession_to_cluster_groups.keys():
print '-'*40
for vf_2 in (vf_accession_to_cluster_groups.keys():
res = 0
if vf_1 != vf_2:
for i in vf_accession_to_cluster_groups[vf_1]:
for j in vf_accession_to_cluster_groups[vf_2]:
if i == j :
res = res + 1
print vf_1, vf_2, res
I obtained something like that:
I managed to discard comparison like that: AI001 AI001 or AI002 AI002...
by using if vf_1 != vf_2
:
But I can't manage to not allow comparison like that: AI014 AI015 then just after, my code compares them in another way AI015 AI014 so basically, what I want is to discard that type of comparison. If compared once don't compare it again in the other way. Can anyone help me please?
Also if any bioinformaticians sees my matrix-ish do you think that I should include the size of the list of cluster to my vf_accession
comparison like doing:
dist = float(res) / len(set(vf_accession_to_cluster_groups[vf_1] + vf_accession_to_cluster_groups[vf_2]))
Thank you all for any help provided.
Upvotes: 2
Views: 75
Reputation: 8184
You may use itertools.combinations
to get each pair only once.
In the following code, I define an Accession
class to represent the information contained in a line of your file: The accession ID (key
attribute) and the clusters (clusters
attribute). The clusters are stored as sets, and this makes it easy to compute the number of clusters common between two accessions. This is implemented in the nb_common
method, simply taking the length of the intersection (&
) between the two sets. An Accession
is created from each line in the file, thanks to the code present in the __init__
method. The list of accessions is passed to the combinations
function, with 2
as second argument, because we want pairs of accessions. We loop on the resulting pairs, and use the nb_common
method to get the number of common clusters between the two accessions of the current pair.
Here I also used sys.argv[1]
to get the file as first argument on the command line.
#!/usr/bin/env python
import sys
from itertools import combinations
class Accession(object):
"""This represents an accession and the corresponding clusters."""
def __init__(self, line):
line_parts = line.strip().split(" ")
self.key = line_parts[0]
# "".join(...) re-creates a string representing the list of clusters
# [1:-1] eliminates the brackets
self.clusters = set("".join(line_parts[1:])[1:-1].split(","))
def nb_common(self, other):
return len(self.clusters & other.clusters)
with open(sys.argv[1], "r") as cluster_file:
accessions = [Accession(line) for line in cluster_file]
for (acc1, acc2) in combinations(accessions, 2):
print acc1.key, acc2.key, acc1.nb_common(acc2)
I call the script as follows:
$ ./compare_accessions.py vf_to_cluster.txt
AI001 AI002 4
AI001 AI004 4
AI001 AI005 4
AI001 AI010 0
AI001 AI011 0
AI001 AI012 4
AI001 AI013 0
AI001 AI014 0
AI001 AI015 5
AI001 AI016 0
AI001 AI017 0
AI002 AI004 4
AI002 AI005 4
AI002 AI010 0
AI002 AI011 0
AI002 AI012 4
AI002 AI013 0
AI002 AI014 0
AI002 AI015 4
AI002 AI016 0
AI002 AI017 0
AI004 AI005 4
AI004 AI010 0
AI004 AI011 0
AI004 AI012 4
AI004 AI013 0
AI004 AI014 0
AI004 AI015 5
AI004 AI016 0
AI004 AI017 0
AI005 AI010 0
AI005 AI011 0
AI005 AI012 4
AI005 AI013 0
AI005 AI014 0
AI005 AI015 4
AI005 AI016 0
AI005 AI017 0
AI010 AI011 0
AI010 AI012 0
AI010 AI013 0
AI010 AI014 1
AI010 AI015 1
AI010 AI016 0
AI010 AI017 0
AI011 AI012 0
AI011 AI013 1
AI011 AI014 0
AI011 AI015 1
AI011 AI016 0
AI011 AI017 0
AI012 AI013 0
AI012 AI014 0
AI012 AI015 4
AI012 AI016 0
AI012 AI017 0
AI013 AI014 0
AI013 AI015 1
AI013 AI016 0
AI013 AI017 0
AI014 AI015 1
AI014 AI016 0
AI014 AI017 0
AI015 AI016 1
AI015 AI017 0
AI016 AI017 0
How to turn the number of common clusters into a distance seems quite an open question. You could ask this in the bioinformatics stackexchange site, not forgetting to introduce your biological questions.
Upvotes: 1
Reputation: 31679
If you don't have millions of keys, you could just store the keys in a list and sort them (makes the results human readable).
cluster_groups = list(vf_accession_to_cluster_groups.keys())
cluster_groups.sort()
Now you can use enumerate
to loop over all the keys (except for the last one because you don't need to compare it to itself):
for index, vf_1 in enumerate(cluster_groups[:-1]):
and for the comparison loop over all the keys after the one you were just using for your outer loop
for vf_2 in cluster_groups[index + 1:]:
Complete code
cluster_groups = list(vf_accession_to_cluster_groups.keys())
cluster_groups.sort()
for index, vf_1 in enumerate(cluster_groups[:-1]):
print('-'*40)
for vf_2 in cluster_groups[index + 1:]:
res = 0
for i in vf_accession_to_cluster_groups[vf_1]:
for j in vf_accession_to_cluster_groups[vf_2]:
if i == j :
res = res + 1
print(vf_1, vf_2, res)
Some small suggestions
If you want to check if an item is in a list, just use
if item in my_list:
Updated code
cluster_groups = list(vf_accession_to_cluster_groups.keys())
cluster_groups.sort()
results = dict()
for index, vf_1 in enumerate(cluster_groups[:-1]):
print('-'*40)
results[vf_1] = dict()
for vf_2 in cluster_groups[index + 1:]:
res = 0
for i in vf_accession_to_cluster_groups[vf_1]:
if i in vf_accession_to_cluster_groups[vf_2]:
res = res + 1
print(vf_1, vf_2, res)
results[vf_1].update({vf_2: res})
def get_results(key1, key2, results):
if key1 > key2:
key1, key2 = key2, key1
if results.get(key1):
return results[key1].get(key2)
return None
Upvotes: 1