Reputation: 21007
I have two sets:
>>> a = {1,2,3}
>>> b = {2,3,4,5,6}
And I would like to get two new sets with non common elements, first set containing elements from a
and second from b
, like ({1}, {4,5,6})
, or like:
>>> c = a&b # Common elements
>>> d = a^b # Symmetric difference
>>> (a-b, b-a)
({1}, {4, 5, 6})
>>> (a-c, b-c)
({1}, {4, 5, 6})
>>> (a&d, b&d)
({1}, {4, 5, 6})
My problem is that I'm going to use this on large number of sha1 hashes and I'm worried about performance. What is proper way of doing this efficiently?
Note: a
and b
are going to have around 95% of elements common, 1% will be in a
and 4% in b
.
Upvotes: 1
Views: 897
Reputation: 21007
Methods I've mentioned in the question has following performance:
>>> timeit.timeit('a-b; b-a', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
135.45828826893307
>>> timeit.timeit('c=a&b; a-c; b-c', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
189.98522938665735
>>> timeit.timeit('d=a^b; a&d; b&d', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
238.35084129583106
So most effective way seems to be using (a-b, b-a)
method.
I'm posting this as a reference so other answers would add new methods, not compare the ones I've found.
Just out of curiosity I've tried implementing own python function to do this (that works on pre-sorted iterators):
def symmetric_diff(l1,l2):
# l1 and l2 has to be sorted and contain comparable elements
r1 = []
r2 = []
i1 = iter(l1)
i2 = iter(l2)
try:
e1 = next(i1)
except StopIteration: return ([], list(i2))
try:
e2 = next(i2)
except StopIteration: return ([e1] + list(i1), [])
try:
while True:
if e1 == e2:
e1 = next(i1)
e2 = next(i2)
elif e1 > e2:
r2.append(e2)
e2 = next(i2)
else:
r1.append(e1)
e1 = next(i1)
except StopIteration:
if e1==e2:
return (r1+list(i1), r2+list(i2))
elif e1 > e2:
return (r1+[e1]+list(i1), r2+list(i2))
else:
return (r1+list(i1), r2+[e2]+list(i2))
Compared to other methods, this one has quite low performance:
t = timeit.Timer(lambda: symmetric_diff(a,b))
>>> t.timeit(1000)
542.3225249653769
So unless some other method is implemented somewhere (some library for working with sets) I think using two sets difference is the most efficient way of doing this.
Upvotes: 2