Reputation: 35
I have the following DataFrame:
df = pd.DataFrame(
{'Column1': ['A', 'B', 'C', 'D'],
'Column2': [['tango', 'bravo', 'alpha'], ['bravo', 'test'], ['romero', 'test'], ['delta']]
})
print(df)
Column1 Column2
0 A [tango, bravo, alpha]
1 B [bravo, test]
2 C [romero, test]
3 D [delta]
I converted the df to a dict like this:
d = {'A': ['tango', 'bravo', 'alpha'],
'B': ['bravo', 'test'],
'C': ['romero', 'test'],
'D': ['delta']}
What I want is to merge all rows (values and keys) that have a common value, which in my case would result in the following dictionary:
d = {"A , B , C": {"tango", "bravo", "alpha" , "bravo", "test", "romero"},
"D": {"delta"}}
This task can be done in the df with pandas or as dictionary, I don't know which one is easier.
Upvotes: 1
Views: 85
Reputation: 7913
I found a solution with networkx.Graph
and networkx.connected_components
.
Using the same input as @mmdanziger since it covers more possible outcomes for your rules.
import networkx as nx
import itertools
d = {
"A": ["tango", "bravo", "alpha"],
"B": ["bravo", "test"],
"C": ["romero", "test"],
"D": ["other"],
"E": ["and", "other"],
"F": ["loner"],
}
#build combination of all keys pairwise
#check for common values in the list
G = nx.Graph()
for nodes in itertools.combinations(d.keys(), r=2):
common_edges = set(d[nodes[0]]) & set(d[nodes[1]])
for edge in common_edges:
G.add_edge(*nodes, value=edge)
# get list with all connected keys (keys which have any common value)
connected = list(nx.connected_components(G))
print(connected)
# [{'A', 'B', 'C'}, {'D', 'E'}]
# create new dict from "connected" with joined keys and joined values
new_dict = {}
for groups in connected:
res = set()
for key in groups:
res.update(d[key])
new_dict[f"{' , '.join(list(groups))}"] = res
# check for elements in the original dictionary which aren't connected to anything and add them
for k,v in d.items():
if not any([k in key for key in new_dict.keys()]):
new_dict[k] = d[k]
print(new_dict)
Output:
{'A , B , C': {'alpha', 'bravo', 'romero', 'tango', 'test'},
'D , E': {'and', 'other'},
'F': ['loner']}
Upvotes: 1
Reputation: 4658
This question is deceptively complex because you may have to merge an arbitrary number of dict items based on a pairwise comparison. I was not able to solve it with a punchy one-liner with itertools
and had to actually articulate the algorithm old-school:
# let's get a slightly more interesting input, in a more amenable datatype
D = {
"A": ["tango", "bravo", "alpha"],
"B": ["bravo", "test"],
"C": ["romero", "test"],
"D": ["other"],
"E": ["and", "other"],
"F": ["loner"],
}
queue, outqueue = list(D.items()), []
while queue:
k, v = queue.pop(0)
shares_values = False
for idx, [otherk, otherv] in enumerate(queue):
if not shares_values and set(v) & set(otherv):
queue.pop(idx) #don't keep re-adding the match
shares_values = True
newk = f"{k}, {otherk}"
newv = v + otherv
queue.insert(0, (newk, newv)) #at 0 so order is respected
break
if not shares_values:
outqueue.append((k, v))
outdict = dict(outqueue)
assert outdict == {
"A, B, C": ["tango", "bravo", "alpha", "bravo", "test", "romero", "test"],
"D, E": ["other", "and", "other"],
"F": ["loner"],
}
There may be some kind of itertools
or DataFrame.groupby.agg
magic that could solve this problem but given that the solution will always require an arbitrary number of passes over the item list, you may be better off using an explicit queue processing approach like the one I've shown here.
Upvotes: 2