Reputation: 3271
I have some gene sequencing data like below:
data = [{'sequence': 'gene1__gene2__gene3', 'occurrence': 10},
{'sequence': 'gene2__gene3', 'occurrence': 5},
{'sequence': 'gene2', 'occurrence': 2},
{'sequence': 'gene4', 'occurrence': 4}
]
I want to transform this into following (tree-like)dictionary
data structure, where any sub-path tells me the co-occurrence count of that set of genes:
tree_dict = {
'gene1': {'occurrence': 10, 'self': 0, 'children': {'gene2': {'occurrence': 10, 'self': 0, 'children': {'gene3': {'occurrence': 10, 'self': 10, 'children': {}}}},
'gene3': {'occurrence': 10, 'self': 0, 'children': {'gene2': {'occurrence': 10, 'self': 10, 'children': {}}}},
}
},
'gene2': {'occurrence': 17, 'self': 2, 'children': {'gene1': {'occurrence': 10, 'self': 0, 'children': {'gene3': {'occurrence': 10, 'self': 10, 'children': {}}}},
'gene3': {'occurrence': 15, 'self': 5, 'children': {'gene1': {'occurrence': 10, 'self': 10, 'children': {}}}},
}
},
'gene3': {'occurrence': 15, 'self': 0, 'children': {'gene1': {'occurrence': 10, 'self': 0, 'children': {'gene2': {'occurrence': 10, 'self': 10, 'children': {}}}},
'gene2': {'occurrence': 15, 'self': 5, 'children': {'gene1': {'occurrence': 10, 'self': 10, 'children': {}}}},
}
},
'gene4': {'occurrence': 4, 'self': 4, 'children': {}}
}
In the tree_dict
above:
self
refers to occurrence of just the nodes in the (sub)path. For ex: gene3
never exists all by itself and thus have self
value of 0; while gene2
exists all by itself 2
times and thus have the self
value of 2. occurrence
refers to occurrence of the nodes in the (sub)path both as substrings and whole.
Code that I tried?
I was trying with failure iterative approaches, when I know that the solution of this have to be a recursive function. Something similar to this question: How to transform a list into a hierarchy dict. But I was not able to make any progress in that direction.
Upvotes: 0
Views: 248
Reputation: 787
Try this:
data = [{'sequence': 'gene1__gene2__gene3', 'occurrence': 10},
{'sequence': 'gene2__gene3', 'occurrence': 5},
{'sequence': 'gene2', 'occurrence': 2},
{'sequence': 'gene4', 'occurrence': 4}]
tree_dict = {}
def generate_tree(sequence, occurrence, curr_dict):
gene_list = sequence.split('__')
for gene in gene_list:
if gene in curr_dict:
curr_dict[gene]['occurrence'] += occurrence
else:
curr_dict[gene] = {'occurrence': occurrence, 'self': 0, 'children': {}}
updated_list = gene_list.copy()
updated_list.remove(gene)
updated_sequence = '__'.join(updated_list)
if updated_sequence != '':
generate_tree(updated_sequence, occurrence, curr_dict[gene]['children'])
else:
curr_dict[gene]['self'] += occurrence
for item in data:
generate_tree(item['sequence'], item['occurrence'], tree_dict)
print(tree_dict)
Upvotes: 1