compare a dictionary to itself but avoiding the comparison of a key twice if already compared

Question

Please I need help again.

I have a file named vf_to_cluster.txt that look like:

From it I made a dictionary called vf_accession_to_cluster_groups where keys are vf_accession (AI0...) and values are the list of cluster groups (['1','2','3'...]).
I've done that by coding this way (I know it's not a pretty code but that what I can do right now with what I know sorry):

f = 'script_folder/vf_to_cluster.txt'
vf_accession_to_cluster_groups = {}

with open(f, 'r') as f6:
    for lines in f6.readlines():
        lines = lines.replace('[', '')
        lines = lines.replace(']', '')
        lines = lines.replace(',', '')
        lines_split = lines.strip().split(' ')
        vf_keys = lines_split[0]
        cluster_values = lines_split[1:]
        vf_accession_to_cluster_groups[vf_keys] = cluster_values

After getting this dictionary my main objectif is to see how many vf_accessions (AI0...) share same cluster groups. So I can say for example that AI001 and AI002 share 4 cluster groups meaning that those two vf_accession are probably the same or really close (coded by same genes).
I made this code:

for vf_1 in vf_accession_to_cluster_groups.keys():
    print '-'*40
    for vf_2 in (vf_accession_to_cluster_groups.keys():
        res = 0 
        if vf_1 != vf_2:
            for i in vf_accession_to_cluster_groups[vf_1]:
                for j in vf_accession_to_cluster_groups[vf_2]:
                    if i == j : 
                        res = res + 1

            print vf_1, vf_2, res

I obtained something like that:

I managed to discard comparison like that: AI001 AI001 or AI002 AI002...
by using if vf_1 != vf_2:

But I can't manage to not allow comparison like that: AI014 AI015 then just after, my code compares them in another way AI015 AI014 so basically, what I want is to discard that type of comparison. If compared once don't compare it again in the other way. Can anyone help me please?

Also if any bioinformaticians sees my matrix-ish do you think that I should include the size of the list of cluster to my vf_accession comparison like doing:

dist = float(res) / len(set(vf_accession_to_cluster_groups[vf_1] + vf_accession_to_cluster_groups[vf_2]))

Thank you all for any help provided.

Maximilian Peters · Accepted Answer

If you don't have millions of keys, you could just store the keys in a list and sort them (makes the results human readable).

cluster_groups = list(vf_accession_to_cluster_groups.keys())
cluster_groups.sort()

Now you can use enumerate to loop over all the keys (except for the last one because you don't need to compare it to itself):

for index, vf_1 in enumerate(cluster_groups[:-1]):

and for the comparison loop over all the keys after the one you were just using for your outer loop

    for vf_2 in cluster_groups[index + 1:]:

Complete code

cluster_groups = list(vf_accession_to_cluster_groups.keys())
cluster_groups.sort()

for index, vf_1 in enumerate(cluster_groups[:-1]):
    print('-'*40)
    for vf_2 in cluster_groups[index + 1:]:
        res = 0 
        for i in vf_accession_to_cluster_groups[vf_1]:
            for j in vf_accession_to_cluster_groups[vf_2]:
                if i == j : 
                    res = res + 1

        print(vf_1, vf_2, res)

Some small suggestions

Store the results in a dictionary so you can retrieve them later. You could use a dictionary of dictionaries.
If you want to check if an item is in a list, just use

if item in my_list:

Updated code

cluster_groups = list(vf_accession_to_cluster_groups.keys())
cluster_groups.sort()

results = dict()

for index, vf_1 in enumerate(cluster_groups[:-1]):
    print('-'*40)
    results[vf_1] = dict()
    for vf_2 in cluster_groups[index + 1:]:
        res = 0 
        for i in vf_accession_to_cluster_groups[vf_1]:
            if i in vf_accession_to_cluster_groups[vf_2]:
                res = res + 1

        print(vf_1, vf_2, res)
        results[vf_1].update({vf_2: res})


def get_results(key1, key2, results):
    if key1 > key2:
        key1, key2 = key2, key1

    if results.get(key1):
        return results[key1].get(key2)
    return None

compare a dictionary to itself but avoiding the comparison of a key twice if already compared

Answers (2)

Related Questions