Slowat_Kela
Slowat_Kela

Reputation: 1511

Python: Combine all dict key values based on one particular key being the same

I know there are a million questions like this, I just can't find an answer that works for me.

I have this:

list1 =   [{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H']}, {'assembly_id': '1', 'asym_id_list': ['C', 'D', 'F', 'I', 'J']}, {'assembly_id':2,'asym_id_list':['D,C'],'auth_id_list':['C','V']}]

if the assembly_ids are the same, I want to combine the other same keys in the dict.

In this example, assembly_id 1 appears twice, so the input above would turn into:

[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H','C', 'D', 'F', 'I', 'J']},{'assembly_id':2,'asym_id_list:['D,C'],'auth_id_list':['C','V']}]

In theory there can be n assembly_ids (i.e. assembly 1 could appear in the dict 10 or 20 times, not just 2) and there can be up to two other lists to combine (asym_id_list and auth_id_list).

I was looking at this method:

new_dict = {}
assembly_list = [] #to keep track of assemblies already seen
for dict_name in list1: #for each dict in the list
        if dict_name['assembly_id'] not in assembly_list: #if the assembly id is new
                new_dict['assembly_id'] = dict_name #this line is wrong, add the entry to new_dict
                assembly_list.append(new_dict['assembly_id']) #append the id to 'assembly_list'
        else:
                new_dict['assembly_id'].append(dict_name) #else if it's already seen, append the dictionaries together, this is wrong
print(new_dict)

The output is wrong:

{'assembly_id': {'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']}}

But I think the idea is right, that I should open a new list and dict, and if not seen before, append; whereas if it has been seen before...combine? But it's just the specifics I'm not getting?

Upvotes: 1

Views: 161

Answers (3)

Glenn Mackintosh
Glenn Mackintosh

Reputation: 2780

@Samwise has provided a good answer to the question you asked and this is not intended to replace that. However, I am going to make a suggestion to the way you are keeping the data after the merge. I would put this in a comment but there is no way to keep code formatting in a comment and it is a bit too big as well.

Before that, I think that you have a typo in your example data. I think that you meant the 'D,C' in 'assembly_id':2,'asym_id_list':['D,C'] to be separate strings like this: 'assembly_id':2,'asym_id_list':['D', 'C']. I am going to assume that below, but if not it does not change any of the code or comments.

Instead of the merged structure being a list of dictionaries like this:

merge_l = [
            {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          ]

Instead, I would recommend not using a list as the top level structure, but instead using a dictionary keyed by the value of the assembly_id. So it would be a dictionary whos values are dictionaries. Like this:

merge_d = { '1': {'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

or if you want to keep the 'assembly_id' as well, like this:

merge_d = { '1': {'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']},
            '2': {'assembly_id': 2, 'asym_id_list': ['D', 'C'], 'auth_id_list': ['C', 'V']}
          }

That last one can be achieved by just changing the return from @Samwise's merge() method and just return m instead of converting the dict to a list.

One other comment on @Samwise code, just so you are aware of it, is that the combined lists can contain duplicates. So if the original data had asym_id_list': ['A', 'B'] in one entry and asym_id_list': ['B', 'C'] in another, the combined list would contain asym_id_list': ['A', 'B', 'B', 'C']. That could be what you want, but if you want to avoid that you could use sets instead of lists for the internal container for asym_id and auth_id containers.

In @Samwise answer, change it something like this:

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', set()) | set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', set()) | set(d['auth_id_list'])
        else:
            m[key] = {'assembly_id': d['assembly_id']}
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = set(d['asym_id_list'])
            if 'auth_id_list' in d:
                m[key]['auth_id_list'] = set(d['auth_id_list'])
    return m

If you go this way, you might want to reconsider the key names 'asym_id_list' and 'auth_id_list' since they are sets not lists. But that may be constrained by the other code around this and what it expects.

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71689

You are logically thinking correctly, we can use a dictionary m which contains key, value pairs of assembly_id and its corresponding dictionary to keep track of visited assembly_ids, whenever a new assembly_id is encountered we add it to the dictionary m otherwise if its already contain the assembly_id we just extend the asym_id_list, auth_id_list for that assembly_id:

def merge(dicts):
    m = {} # keeps track of the visited assembly_ids
    for d in dicts:
        key = d['assembly_id'] # assembly_id is used as merge/grouping key
        if key in m:
            if 'asym_id_list' in d:
                m[key]['asym_id_list'] = m[key].get('asym_id_list', []) + d['asym_id_list']
            elif 'auth_id_list' in d:
                m[key]['auth_id_list'] = m[key].get('auth_id_list', []) + d['auth_id_list']
        else:
            m[key] = d
    return list(m.values())

Result:

# merge(list1)
[
    {
        'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']
    },
    {
        'assembly_id': 2, 'asym_id_list': ['D,C'], 'auth_id_list': ['C', 'V']
    }
]

Upvotes: 1

Samwise
Samwise

Reputation: 71454

Use a dict keyed on assembly_id to collect all the data for a given key; you can then go back and generate a list of dicts in the original format if needed.

>>> from collections import defaultdict
>>> from typing import Dict, List
>>> id_lists: Dict[str, List[str]] = defaultdict(list)
>>> for d in list1:
...     id_lists[d['assembly_id']].extend(d['asym_id_list'])
...
>>> combined_list = [{
...     'assembly_id': id, 'asym_id_list': id_list
... } for id, id_list in id_lists.items()]
>>> combined_list
[{'assembly_id': '1', 'asym_id_list': ['A', 'B', 'E', 'G', 'H', 'C', 'D', 'F', 'I', 'J']}, {'assembly_id': 2, 'asym_id_list': ['D,C']}]
>>>

(edit) didn't see the bit about auth_id_lists because it's hidden in the scroll in the original code -- same strategy applies, just either use two dicts in the first step or make it a dict of some collection of lists (e.g. a dict of dicts of lists, with the outer dict keyed on assembly_id values and the inner dict keyed on the original field name).

Upvotes: 1

Related Questions