Reputation: 1511
I know there are a million questions like this, I just can't find an answer that works for me.
I have this:
list1 = [{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H']}, {'assembly_id': '1', 'asym_id_list': ['C', 'D', 'F', 'I', 'J']}, {'assembly_id':2,'asym_id_list':['D,C'],'auth_id_list':['C','V']}]
if the assembly_ids are the same, I want to combine the other same keys in the dict.
In this example, assembly_id 1 appears twice, so the input above would turn into:
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H','C', 'D', 'F', 'I', 'J']},{'assembly_id':2,'asym_id_list:['D,C'],'auth_id_list':['C','V']}]
In theory there can be n assembly_ids (i.e. assembly 1 could appear in the dict 10 or 20 times, not just 2) and there can be up to two other lists to combine (asym_id_list and auth_id_list).
I was looking at this method:
new_dict = {}
assembly_list = [] #to keep track of assemblies already seen
for dict_name in list1: #for each dict in the list
if dict_name['assembly_id'] not in assembly_list: #if the assembly id is new
new_dict['assembly_id'] = dict_name #this line is wrong, add the entry to new_dict
assembly_list.append(new_dict['assembly_id']) #append the id to 'assembly_list'
else:
new_dict['assembly_id'].append(dict_name) #else if it's already seen, append the dictionaries together, this is wrong
print(new_dict)
The output is wrong:
{'assembly_id': {'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']}}
But I think the idea is right, that I should open a new list and dict, and if not seen before, append; whereas if it has been seen before...combine? But it's just the specifics I'm not getting?
Upvotes: 1
Views: 161
Reputation: 2780
@Samwise has provided a good answer to the question you asked and this is not intended to replace that. However, I am going to make a suggestion to the way you are keeping the data after the merge. I would put this in a comment but there is no way to keep code formatting in a comment and it is a bit too big as well.
Before that, I think that you have a typo in your example data. I think that you meant the 'D,C'
in 'assembly_id':2,'asym_id_list':['D,C']
to be separate strings like this: 'assembly_id':2,'asym_id_list':['D', 'C']
. I am going to assume that below, but if not it does not change any of the code or comments.
Instead of the merged structure being a list of dictionaries like this:
merge_l = [
{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
{'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
]
Instead, I would recommend not using a list as the top level structure, but instead using a dictionary keyed by the value of the assembly_id. So it would be a dictionary whos values are dictionaries. Like this:
merge_d = { '1': {'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
'2': {'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
}
or if you want to keep the 'assembly_id' as well, like this:
merge_d = { '1': {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
'2': {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
}
That last one can be achieved by just changing the return from @Samwise's merge()
method and just return m
instead of converting the dict to a list.
One other comment on @Samwise code, just so you are aware of it, is that the combined lists can contain duplicates. So if the original data had asym_id_list': ['A', 'B']
in one entry and asym_id_list': ['B', 'C']
in another, the combined list would contain asym_id_list': ['A', 'B', 'B', 'C']
. That could be what you want, but if you want to avoid that you could use sets instead of lists for the internal container for asym_id and auth_id containers.
In @Samwise answer, change it something like this:
def merge(dicts):
m = {} # keeps track of the visited assembly_ids
for d in dicts:
key = d['assembly_id'] # assembly_id is used as merge/grouping key
if key in m:
if 'asym_id_list' in d:
m[key]['asym_id_list'] = m[key].get('asym_id_list', set()) | set(d['asym_id_list'])
if 'auth_id_list' in d:
m[key]['auth_id_list'] = m[key].get('auth_id_list', set()) | set(d['auth_id_list'])
else:
m[key] = {'assembly_id': d['assembly_id']}
if 'asym_id_list' in d:
m[key]['asym_id_list'] = set(d['asym_id_list'])
if 'auth_id_list' in d:
m[key]['auth_id_list'] = set(d['auth_id_list'])
return m
If you go this way, you might want to reconsider the key names 'asym_id_list'
and 'auth_id_list'
since they are sets not lists. But that may be constrained by the other code around this and what it expects.
Upvotes: 0
Reputation: 71689
You are logically thinking correctly, we can use a dictionary m
which contains key, value pairs of assembly_id
and its corresponding dictionary to keep track of visited assembly_ids
, whenever a new assembly_id
is encountered we add it to the dictionary m
otherwise if its already contain the assembly_id
we just extend the asym_id_list
, auth_id_list
for that assembly_id
:
def merge(dicts):
m = {} # keeps track of the visited assembly_ids
for d in dicts:
key = d['assembly_id'] # assembly_id is used as merge/grouping key
if key in m:
if 'asym_id_list' in d:
m[key]['asym_id_list'] = m[key].get('asym_id_list', []) + d['asym_id_list']
elif 'auth_id_list' in d:
m[key]['auth_id_list'] = m[key].get('auth_id_list', []) + d['auth_id_list']
else:
m[key] = d
return list(m.values())
Result:
# merge(list1)
[
{
'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']
},
{
'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']
}
]
Upvotes: 1
Reputation: 71454
Use a dict keyed on assembly_id
to collect all the data for a given key; you can then go back and generate a list of dicts in the original format if needed.
>>> from collections import defaultdict
>>> from typing import Dict, List
>>> id_lists: Dict[str, List[str]] = defaultdict(list)
>>> for d in list1:
... id_lists[d['assembly_id']].extend(d['asym_id_list'])
...
>>> combined_list = [{
... 'assembly_id': id, 'asym_id_list': id_list
... } for id, id_list in id_lists.items()]
>>> combined_list
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']}, {'assembly_id': 2, 'asym_id_list': ['D,C']}]
>>>
(edit) didn't see the bit about auth_id_lists
because it's hidden in the scroll in the original code -- same strategy applies, just either use two dicts in the first step or make it a dict of some collection of lists (e.g. a dict of dicts of lists, with the outer dict keyed on assembly_id
values and the inner dict keyed on the original field name).
Upvotes: 1