Gab
Gab

Reputation: 35

How to merge all dictionary items with common values in python

I have the following DataFrame:

df = pd.DataFrame(
    {'Column1': ['A', 'B', 'C', 'D'],
     'Column2': [['tango', 'bravo', 'alpha'], ['bravo', 'test'], ['romero', 'test'], ['delta']]
    })
print(df)

  Column1                Column2
0       A  [tango, bravo, alpha]
1       B          [bravo, test]
2       C         [romero, test]
3       D                [delta]

I converted the df to a dict like this:

d = {'A': ['tango', 'bravo', 'alpha'],
     'B': ['bravo', 'test'],
     'C': ['romero', 'test'],
     'D': ['delta']}

What I want is to merge all rows (values and keys) that have a common value, which in my case would result in the following dictionary:

d = {"A , B , C": {"tango", "bravo", "alpha" , "bravo", "test", "romero"},
     "D": {"delta"}}

This task can be done in the df with pandas or as dictionary, I don't know which one is easier.

Upvotes: 1

Views: 85

Answers (2)

Rabinzel
Rabinzel

Reputation: 7913

I found a solution with networkx.Graph and networkx.connected_components.

Using the same input as @mmdanziger since it covers more possible outcomes for your rules.

import networkx as nx
import itertools

d = {
    "A": ["tango", "bravo", "alpha"],
    "B": ["bravo", "test"],
    "C": ["romero", "test"],
    "D": ["other"],
    "E": ["and", "other"],
    "F": ["loner"],
}

#build combination of all keys pairwise
#check for common values in the list
G = nx.Graph()
for nodes in itertools.combinations(d.keys(), r=2):
    common_edges = set(d[nodes[0]]) & set(d[nodes[1]])
    for edge in common_edges:
        G.add_edge(*nodes, value=edge)

# get list with all connected keys (keys which have any common value)
connected = list(nx.connected_components(G))
print(connected)
# [{'A', 'B', 'C'}, {'D', 'E'}]

# create new dict from "connected" with joined keys and joined values
new_dict = {}
for groups in connected:
    res = set()
    for key in groups:
        res.update(d[key])
    new_dict[f"{' , '.join(list(groups))}"] = res

# check for elements in the original dictionary which aren't connected to anything and add them
for k,v in d.items():
    if not any([k in key for key in new_dict.keys()]):
        new_dict[k] = d[k]

print(new_dict)

Output:

{'A , B , C': {'alpha', 'bravo', 'romero', 'tango', 'test'},
 'D , E': {'and', 'other'},
 'F': ['loner']}

Upvotes: 1

mmdanziger
mmdanziger

Reputation: 4658

This question is deceptively complex because you may have to merge an arbitrary number of dict items based on a pairwise comparison. I was not able to solve it with a punchy one-liner with itertools and had to actually articulate the algorithm old-school:

# let's get a slightly more interesting input, in a more amenable datatype
D = {
    "A": ["tango", "bravo", "alpha"],
    "B": ["bravo", "test"],
    "C": ["romero", "test"],
    "D": ["other"],
    "E": ["and", "other"],
    "F": ["loner"],
}

queue, outqueue = list(D.items()), []
while queue:
    k, v = queue.pop(0)
    shares_values = False
    for idx, [otherk, otherv] in enumerate(queue):
        if not shares_values and set(v) & set(otherv):
            queue.pop(idx) #don't keep re-adding the match
            shares_values = True
            newk = f"{k}, {otherk}"
            newv = v + otherv
            queue.insert(0, (newk, newv)) #at 0 so order is respected
            break
    if not shares_values:
        outqueue.append((k, v))
outdict = dict(outqueue)
assert outdict == {
    "A, B, C": ["tango", "bravo", "alpha", "bravo", "test", "romero", "test"],
    "D, E": ["other", "and", "other"],
    "F": ["loner"],
}

There may be some kind of itertools or DataFrame.groupby.agg magic that could solve this problem but given that the solution will always require an arbitrary number of passes over the item list, you may be better off using an explicit queue processing approach like the one I've shown here.

Upvotes: 2

Related Questions