Comparing key from first dictionary to values from second dictionary

Question

Please I need some help again.

I have a big data base file (let's call it db.csv) containing many informations.

Simplified database file to illustrate:

I run usearch61 -cluster_fast on my genes sequences in order to cluster them.
I obtained a file named 'clusters.uc'. I opened it as csv then I made a code to create a dictionary (let's say dict_1) to have my cluster number as keys and my gene_id (VFG...) as values.
Here is an example of what I made then stored in a file: dict_1

 0 ['VFG003386', 'VFG034084', 'VFG003381']  
 1 ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636']  
 2 ['VFG018349', 'VFG018485', 'VFG043567']  
 ...  
 14471 ['VFG015743', 'VFG002143']

So far so good. Then using db.csv I made another dictionary (dict_2) were gene_id (VFG...) are keys and VF_Accession (IA... or CVF.. or VF...) are values, illustration: dict_2

 VFG044259 IA027
 VFG044258 IA027
 VFG011941 CVF397
 VFG012016 CVF399
 ...

What I want in the end is to have for each VF_Accession the numbers of cluster groups, illustration:

IA027 [0,5,6,8]
CVF399 [15, 1025, 1562, 1712]
...

So I guess since I'm still a beginner in coding that I need to create a code that compare values from dict_1(VFG...) to keys from dict_2(VFG...). If they match put VF_Accession as a key with all cluster numbers as values. Since VF_Accession are keys they can't have duplicate I need a dictionary of list. I guess I can do that because I made it for dict_1. But my problem is that I can't figure out a way to compare values from dict_1 to keys from dict_2 and put to each VF_Accession a cluster number. Please help me.

BioGeek · Accepted Answer

First, let's give your dictionaries some better names then dict_1, dict_2, ... that makes it easier to work with them and to remember what they contain.

You first created a dictionary that has cluster numbers as keys and gene_ids (VFG...) as values:

cluster_nr_to_gene_ids = {0: ['VFG003386', 'VFG034084', 'VFG003381', 'VFG044259'],
                          1: ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'],
                          2: ['VFG018349', 'VFG018485', 'VFG043567', 'VFG012016'],
                          5: ['VFG011941'],
                          7949: ['VFG003386'],                              
                          14471: ['VFG015743', 'VFG002143', 'VFG012016']}

And you also have another dictionary where gene_ids are keys and VF_Accessions (IA... or CVF.. or VF...) are values:

gene_id_to_vf_accession = {'VFG044259': 'IA027',
                           'VFG044258': 'IA027',
                           'VFG011941': 'CVF397',
                           'VFG012016': 'CVF399',
                           'VFG000676': 'VF0142',
                           'VFG002231': 'VF0369',
                           'VFG003386': 'CVF051'}

And we want to create a dictionary where each VF_Accession key has as value the numbers of cluster groups: vf_accession_to_cluster_groups.

We also note that a VF Accession belongs to multiple gene IDs (for example: the VF Accession IA027 has both the VFG044259 and the VFG044258 gene IDs.

So we use defaultdict to make a dictionary with VF Accession as key and a list of gene IDs as value

from collections import defaultdict
vf_accession_to_gene_ids = defaultdict(list)
for gene_id, vf_accession in gene_id_to_vf_accession.items():
    vf_accession_to_gene_ids[vf_accession].append(gene_id)

For the sample data I posted above, vf_accession_to_gene_ids now looks like:

defaultdict(, {'VF0142': ['VFG000676'], 
                             'CVF051': ['VFG003386'], 
                             'IA027':  ['VFG044258', 'VFG044259'],
                             'CVF399': ['VFG012016'], 
                             'CVF397': ['VFG011941'], 
                             'VF0369': ['VFG002231']})

Now we can loop over each VF Accession and look up its list of gene IDs. Then, for each gene ID, we loop over every cluster and see if the gene ID is present there:

vf_accession_to_cluster_groups = {}
for vf_accession in vf_accession_to_gene_ids:
    gene_ids = vf_accession_to_gene_ids[vf_accession]
    cluster_group = []
    for gene_id in gene_ids:
        for cluster_nr in cluster_nr_to_gene_ids:
            if gene_id in cluster_nr_to_gene_ids[cluster_nr]:
                cluster_group.append(cluster_nr)
    vf_accession_to_cluster_groups[vf_accession] = cluster_group

The end result for the above sample data now is:

{'VF0142': [], 
 'CVF051': [0, 7949], 
 'IA027':  [0], 
 'CVF399': [2, 14471], 
 'CVF397': [5], 
 'VF0369': []}

Comparing key from first dictionary to values from second dictionary

Answers (2)

Related Questions