Finding symmetric difference of 2 sets separating them by origin

I have two sets:

>>> a = {1,2,3}
>>> b = {2,3,4,5,6}

And I would like to get two new sets with non common elements, first set containing elements from a and second from b, like ({1}, {4,5,6}), or like:

>>> c = a&b # Common elements
>>> d = a^b # Symmetric difference
>>> (a-b, b-a)
({1}, {4, 5, 6})
>>> (a-c, b-c)
({1}, {4, 5, 6})
>>> (a&d, b&d)
({1}, {4, 5, 6})

My problem is that I'm going to use this on large number of sha1 hashes and I'm worried about performance. What is proper way of doing this efficiently?

Note: a and b are going to have around 95% of elements common, 1% will be in a and 4% in b.

Upvotes: 1

Answers (1)

Vyktor

Reputation: 21007

Methods I've mentioned in the question has following performance:

>>> timeit.timeit('a-b; b-a', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
135.45828826893307
>>> timeit.timeit('c=a&b; a-c; b-c', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
189.98522938665735
>>> timeit.timeit('d=a^b; a&d; b&d', 'a=set(range(0,1500000)); b=set(range(1000000, 2000000))', number=1000)
238.35084129583106

So most effective way seems to be using (a-b, b-a) method.

I'm posting this as a reference so other answers would add new methods, not compare the ones I've found.

Python implemented function

Just out of curiosity I've tried implementing own python function to do this (that works on pre-sorted iterators):

def symmetric_diff(l1,l2):
    # l1 and l2 has to be sorted and contain comparable elements
    r1 = []
    r2 = []
    i1 = iter(l1)
    i2 = iter(l2)

    try:
        e1 = next(i1)
    except StopIteration: return ([], list(i2))
    try:
        e2 = next(i2)
    except StopIteration: return ([e1] + list(i1), [])

    try:
        while True: 
            if e1 == e2:
                e1 = next(i1)
                e2 = next(i2)
            elif e1 > e2:
                r2.append(e2)
                e2 = next(i2)
            else:
                r1.append(e1)
                e1 = next(i1)

    except StopIteration:
        if e1==e2:
            return (r1+list(i1), r2+list(i2))
        elif e1 > e2:
            return (r1+[e1]+list(i1), r2+list(i2))
        else:   
            return (r1+list(i1), r2+[e2]+list(i2))

Compared to other methods, this one has quite low performance:

t = timeit.Timer(lambda: symmetric_diff(a,b))
>>> t.timeit(1000)
542.3225249653769

So unless some other method is implemented somewhere (some library for working with sets) I think using two sets difference is the most efficient way of doing this.

Upvotes: 2

Finding symmetric difference of 2 sets separating them by origin

Answers (1)

Python implemented function

Related Questions