How to merge all dictionary items with common values in python

Question

I have the following DataFrame:

df = pd.DataFrame(
    {'Column1': ['A', 'B', 'C', 'D'],
     'Column2': [['tango', 'bravo', 'alpha'], ['bravo', 'test'], ['romero', 'test'], ['delta']]
    })
print(df)

  Column1                Column2
0       A  [tango, bravo, alpha]
1       B          [bravo, test]
2       C         [romero, test]
3       D                [delta]

I converted the df to a dict like this:

d = {'A': ['tango', 'bravo', 'alpha'],
     'B': ['bravo', 'test'],
     'C': ['romero', 'test'],
     'D': ['delta']}

What I want is to merge all rows (values and keys) that have a common value, which in my case would result in the following dictionary:

d = {"A , B , C": {"tango", "bravo", "alpha" , "bravo", "test", "romero"},
     "D": {"delta"}}

This task can be done in the df with pandas or as dictionary, I don't know which one is easier.

mmdanziger · Accepted Answer

This question is deceptively complex because you may have to merge an arbitrary number of dict items based on a pairwise comparison. I was not able to solve it with a punchy one-liner with itertools and had to actually articulate the algorithm old-school:

# let's get a slightly more interesting input, in a more amenable datatype
D = {
    "A": ["tango", "bravo", "alpha"],
    "B": ["bravo", "test"],
    "C": ["romero", "test"],
    "D": ["other"],
    "E": ["and", "other"],
    "F": ["loner"],
}

queue, outqueue = list(D.items()), []
while queue:
    k, v = queue.pop(0)
    shares_values = False
    for idx, [otherk, otherv] in enumerate(queue):
        if not shares_values and set(v) & set(otherv):
            queue.pop(idx) #don't keep re-adding the match
            shares_values = True
            newk = f"{k}, {otherk}"
            newv = v + otherv
            queue.insert(0, (newk, newv)) #at 0 so order is respected
            break
    if not shares_values:
        outqueue.append((k, v))
outdict = dict(outqueue)
assert outdict == {
    "A, B, C": ["tango", "bravo", "alpha", "bravo", "test", "romero", "test"],
    "D, E": ["other", "and", "other"],
    "F": ["loner"],
}

There may be some kind of itertools or DataFrame.groupby.agg magic that could solve this problem but given that the solution will always require an arbitrary number of passes over the item list, you may be better off using an explicit queue processing approach like the one I've shown here.

How to merge all dictionary items with common values in python

Answers (2)

Related Questions