glitterbox
glitterbox

Reputation: 43

For loop for a dictionary

I have a big dictionary containing information about lots of clusters and their genes. I'm trying to access part of the data about 'cdhitclusters'. This section of code works perfectly and does exactly what I want it to do (counting the number of rep_genes per cluster). I just don't know how to write a for loop to do this for all the clusters in the dictionary.

clus1 = (gene_clusters.get("cluster-1"))
cdhit1 = (clus1.get("cdhitclusters"))
rep1 = pd.DataFrame(cdhit1)
print(len(rep1.rep_gene))

Here's a section of the dictionary:

{
    'cluster-1': {
        'BGCid': '-',
        'cdhitclusters': [
            {
                'genes': { 'AT1G24070': 100.0 },
                'rep_gene': 'AT1G24070'
            },
            {
                'genes': { 'AT1G24100': 100.0 },
                'rep_gene': 'AT1G24100'
            },
            {
                'genes': {
                    'AT1G24040': 100.0,
                    'AT1G2404_1': 100.0,
                    'AT1G2404_2': 100.0
                },
                'rep_gene': 'AT1G24040'
            },
            {
                'genes': {
                    'AT1G24020': 100.0,
                    'AT1G2402_1': 100.0
                },
                'rep_gene': 'AT1G24020'
            },
            {
                'genes': { 'AT1G24010': 100.0 },
                'rep_gene': 'AT1G24010'
            },
            {
                'genes': { 'AT1G24000': 100.0 },
                'rep_gene': 'AT1G24000'
            }
        ]
    ...

There are 45 clusters, how can I write a loop to do as the code above does, but for all the clusters?

I want it to output to a dataframe that I can add to a larger data frame. This is the code I'm using, but it only calculates the CDhit for the first cluster in the loop. What am I doing wrong?

for clus in gene_clusters.values():
    cdhit = (clus.get("cdhitclusters"))
    rep = pd.DataFrame(cdhit)
    replen = rep.iloc[:,0]
    replen1 = len(rep.rep_gene)
    list = [replen1]    
    replen2 = pd.DataFrame(list, columns=['CDhits'])
    replen2 = replen2.CDhits

Upvotes: 0

Views: 116

Answers (3)

Franz Gastring
Franz Gastring

Reputation: 1130

Something like this

for cluster_name, cluster_data in data.items():
print(f"Cluster Name: {cluster_name}")
print(f"BGCid: {cluster_data['BGCid']}")

for cdhitcluster in cluster_data['cdhitclusters']:
    print("CD-Hit Cluster:")
    
    for gene, percentage in cdhitcluster['genes'].items():
        print(f"Gene: {gene}, Percentage: {percentage}")

Upvotes: 0

nachtgoblin24
nachtgoblin24

Reputation: 66

I think in this case an easy way would be:

for cluster in gene_clusters.keys():
   #your code

I tried with the section of the dict you provided and gene_clusters.keys() produced exactly what you wanted, a list of clusters.

Upvotes: 0

Mureinik
Mureinik

Reputation: 310993

You don't seem to use the key, so you could just iterate over the dictionary's values():

for clus in gene_clusters.values():
    cdhit = (clus.get("cdhitclusters"))
    rep = pd.DataFrame(cdhit)
    print(len(rep.rep_gene))

Upvotes: 2

Related Questions