Reputation: 43
I have a big dictionary containing information about lots of clusters and their genes. I'm trying to access part of the data about 'cdhitclusters'. This section of code works perfectly and does exactly what I want it to do (counting the number of rep_genes per cluster). I just don't know how to write a for loop to do this for all the clusters in the dictionary.
clus1 = (gene_clusters.get("cluster-1"))
cdhit1 = (clus1.get("cdhitclusters"))
rep1 = pd.DataFrame(cdhit1)
print(len(rep1.rep_gene))
Here's a section of the dictionary:
{
'cluster-1': {
'BGCid': '-',
'cdhitclusters': [
{
'genes': { 'AT1G24070': 100.0 },
'rep_gene': 'AT1G24070'
},
{
'genes': { 'AT1G24100': 100.0 },
'rep_gene': 'AT1G24100'
},
{
'genes': {
'AT1G24040': 100.0,
'AT1G2404_1': 100.0,
'AT1G2404_2': 100.0
},
'rep_gene': 'AT1G24040'
},
{
'genes': {
'AT1G24020': 100.0,
'AT1G2402_1': 100.0
},
'rep_gene': 'AT1G24020'
},
{
'genes': { 'AT1G24010': 100.0 },
'rep_gene': 'AT1G24010'
},
{
'genes': { 'AT1G24000': 100.0 },
'rep_gene': 'AT1G24000'
}
]
...
There are 45 clusters, how can I write a loop to do as the code above does, but for all the clusters?
I want it to output to a dataframe that I can add to a larger data frame. This is the code I'm using, but it only calculates the CDhit for the first cluster in the loop. What am I doing wrong?
for clus in gene_clusters.values():
cdhit = (clus.get("cdhitclusters"))
rep = pd.DataFrame(cdhit)
replen = rep.iloc[:,0]
replen1 = len(rep.rep_gene)
list = [replen1]
replen2 = pd.DataFrame(list, columns=['CDhits'])
replen2 = replen2.CDhits
Upvotes: 0
Views: 116
Reputation: 1130
Something like this
for cluster_name, cluster_data in data.items():
print(f"Cluster Name: {cluster_name}")
print(f"BGCid: {cluster_data['BGCid']}")
for cdhitcluster in cluster_data['cdhitclusters']:
print("CD-Hit Cluster:")
for gene, percentage in cdhitcluster['genes'].items():
print(f"Gene: {gene}, Percentage: {percentage}")
Upvotes: 0
Reputation: 66
I think in this case an easy way would be:
for cluster in gene_clusters.keys():
#your code
I tried with the section of the dict you provided and gene_clusters.keys() produced exactly what you wanted, a list of clusters.
Upvotes: 0
Reputation: 310993
You don't seem to use the key, so you could just iterate over the dictionary's values()
:
for clus in gene_clusters.values():
cdhit = (clus.get("cdhitclusters"))
rep = pd.DataFrame(cdhit)
print(len(rep.rep_gene))
Upvotes: 2