Reputation: 6699
I have a collection of lists, some of which have overlapping elements:
coll = [['aaaa', 'aaab', 'abaa'],
['bbbb', 'bbbb'],
['aaaa', 'bbbb'],
['dddd', 'dddd'],
['bbbb', 'bbbb', 'cccc','aaaa'],
['eeee','eeef','gggg','gggi'],
['gggg','hhhh','iiii']]
I want to pool together only the lists that overlap, which would yield
pooled = [['aaaa', 'aaab', 'abaa','bbbb','cccc'],
['eeee','eeef','gggg','gggi','hhhh','iiii'],
['dddd', 'dddd']]
(In case it's not clear, the first and second list both overlap with the third and should therefore all be merged together, even though they don't themselves contain elements in common.)
"Overlap" means two lists have at least one element in common. "Merge" means joining the two lists into either a single flat list or a single flat set.
There may be several sets, e.g. x, y, and z overlap with each other, v and w overlap with each other, but x+y+z don't overlap with v+w. And there may be lists that don't overlap with anything.
(An analogy is families. Join all the Montagues together, join all the Capulets together, but no Montague has ever married into the Capulets so those two clusters will remain distinct.)
I don't care whether duplicate items are included multiple times or not.
What's a simple and reasonably fast way of doing this in Python?
Edit: This does not seem to be a duplicate of Yet another merging list of lists, but most pythonic way because that doesn't seem to consider groups that only overlap through a third set. The solutions I've tried from that question don't yield the answer I'm looking for here.
Upvotes: 0
Views: 2202
Reputation: 11
I'm not sure how this code goes against others in regard of speed, but its quite simple.
for i in range(len(coll)-1,-1,-1):
for j in range(i-1,-1,-1):
if set(coll[j]) & set(coll[i]):
coll[j].extend(coll.pop(i))
break
output:
[['aaaa', 'aaab', 'abaa', 'bbbb', 'bbbb', 'aaaa', 'bbbb', 'bbbb', 'bbbb', 'cccc', 'aaaa'],
['dddd', 'dddd'],
['eeee', 'eeef', 'gggg', 'gggi', 'gggg', 'hhhh', 'iiii']]
what is good, it's in place and outputs duplicated values in every sublist - both of those things can be easly changed:
Upvotes: 0
Reputation: 42143
You can do this with sets using a successive merging approach:
coll = [['aaaa', 'aaab', 'abaa'],
['bbbb', 'bbbb'],
['aaaa', 'bbbb'],
['dddd', 'dddd'],
['bbbb', 'bbbb', 'cccc','aaaa'],
['eeee','eeef','gggg','gggi'],
['gggg','hhhh','iiii']]
pooled = [set(subList) for subList in coll]
merging = True
while merging:
merging=False
for i,group in enumerate(pooled):
merged = next((g for g in pooled[i+1:] if g.intersection(group)),None)
if not merged: continue
group.update(merged)
pooled.remove(merged)
merging = True
print(pooled)
# [{'aaaa', 'abaa', 'aaab', 'cccc', 'bbbb'}, {'dddd'}, {'gggg', 'eeef', 'eeee', 'hhhh', 'gggi', 'iiii'}]
Upvotes: 1
Reputation: 6699
Working on a suggestion from alkasm in the comments, I used networkx:
import networkx as nx
coll = [['aaaa', 'aaab', 'abaa'],
['bbbb', 'bbbb'],
['aaaa', 'bbbb'],
['dddd', 'dddd'],
['bbbb', 'bbbb', 'cccc','aaaa'],
['eeee','eeef','gggg','gggi'],
['gggg','hhhh','iiii']]
edges = []
for i in range(len(coll)):
a = coll[i]
for j in range(len(coll)):
if i != j:
b = coll[j]
if set(a).intersection(set(b)):
edges.append((i,j))
G = nx.Graph()
G.add_nodes_from(range(len(coll)))
G.add_edges_from(edges)
for c in nx.connected_components(G):
combined_lists = [coll[i] for i in c]
flat_list = [item for sublist in combined_lists for item in sublist]
print(set(flat_list))
Output:
{'cccc', 'bbbb', 'aaab', 'aaaa', 'abaa'}
{'dddd'}
{'eeef', 'eeee', 'hhhh', 'gggg', 'gggi', 'iiii'}
Undoubtedly this can be optimized but it seems to solve my problem for now.
Upvotes: 0
Reputation: 2359
Here's a way to do it (assuming you want unique elements over the overlapping results):
def over(coll):
print('Input is:\n', coll)
# gather the lists that do overlap
overlapping = [x for x in coll if any(x_element in [y for k in coll if k != x for y in k] for x_element in x)]
# flatten and get unique
overlapping = sorted(list(set([z for x in overlapping for z in x])))
# get the rest
non_overlapping = [x for x in coll if all(y not in overlapping for y in x)]
# use the line bellow only if merged non-overlapping elements are desired
# non_overlapping = sorted([y for x in non_overlapping for y in x])
print('Output is"\n',[overlapping, non_overlapping])
coll = [['aaaa', 'aaab', 'abaa'],
['bbbb', 'bbbb'],
['aaaa', 'bbbb'],
['dddd', 'dddd'],
['bbbb', 'bbbb', 'cccc','aaaa']]
over(coll)
coll = [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]
over(coll)
output:
$ python3 over.py -- NORMAL --
Input is:
[['aaaa', 'aaab', 'abaa'], ['bbbb', 'bbbb'], ['aaaa', 'bbbb'], ['dddd', 'dddd'], ['bbbb', 'bbbb', 'cccc', 'aaaa']]
Output is"
[['aaaa', 'aaab', 'abaa', 'bbbb', 'cccc'], [['dddd', 'dddd']]]
Input is:
[['aaaa', 'aaaa'], ['bbbb', 'bbbb']]
Output is"
[[], [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]]
Upvotes: 1