user625477
user625477

Reputation: 3276

find the "overlap" between 2 python lists

Given 2 lists:

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

I want to find the "overlap":

c = [3,4,5,5,6]

I'd also like it if i could extract the "remainder" the part of a and b that's not in c.

a_remainder = [5,]
b_remainder = [1,4,7,]

Note: a has three 5's in it and b has two. b has two 4's in it and a has one.

The resultant list c should have two 5's (limited by list b) and one 4 (limited by list a).

This gives me what i want, but I can't help but think there's a much better way.

import copy

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

c = []
for elem in copy.deepcopy(a):
    if elem in b:
        a.pop(a.index(elem))
        c.append(b.pop(b.index(elem)))

# now a and b both contain the "remainders" and c contains the "overlap"

On another note, what is a more accurate name for what I'm asking for than "overlap" and "remainder"?

Upvotes: 22

Views: 32502

Answers (8)

Rosh Oxymoron
Rosh Oxymoron

Reputation: 21055

collection.Counter available in Python 2.7 can be used to implement multisets that do exactly what you want.

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

a_multiset = collections.Counter(a)
b_multiset = collections.Counter(b)

overlap = list((a_multiset & b_multiset).elements())
a_remainder = list((a_multiset - b_multiset).elements())
b_remainder = list((b_multiset - a_multiset).elements())

print overlap, a_remainder, b_remainder

Upvotes: 24

Steven
Steven

Reputation: 28666

Try difflib.SequenceMatcher(), "a flexible class for comparing pairs of sequences of any type"...

A quick try:

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]

sm = difflib.SequenceMatcher(None, a, b)
c = []
a_remainder = []
b_remainder = []

for tag, i1, i2, j1, j2 in sm.get_opcodes():
    if tag == 'replace':
        a_remainder.extend(a[i1:i2])
        b_remainder.extend(b[j1:j2])
    elif tag == 'delete':
        a_remainder.extend(a[i1:i2])
    elif tag == 'insert':
        b_remainder.extend(b[j1:j2])
    elif tag == 'equal':
        c.extend(a[i1:i2])

And now...

>>> print c
[3, 4, 5, 5, 6]
>>> print a_remainder
[5]
>>> print b_remainder
[1, 4, 7]

Upvotes: 1

Andrew Jaffe
Andrew Jaffe

Reputation: 27087

OK, verbose, but kind of cool (similar in spirit to the collections.Counter idea, but more home-made):

import itertools as it
flatten = it.chain.from_iterable 
sorted(
   v for u,v in 
    set(flatten(enumerate(g) 
        for k, g in it.groupby(a))).intersection(
    set(flatten(enumerate(g)
        for k, g in it.groupby(b))))
   )

The basic idea is to make each of the lists into a new list which attaches a counter to each object, numbered to account for duplicates -- so that then you can then use set operations on these tuples after all.

To be slightly less verbose:

 aa = set(flatten(enumerate(g) for k, g in it.groupby(a)))
 bb = set(flatten(enumerate(g) for k, g in it.groupby(b)))
 # aa = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5), (2, 5)])
 # bb = set([(0, 1), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (1, 4), (1, 5)])

 cc = aa.intersection(bb)
 # cc = set([(0, 3), (0, 4), (0, 5), (0, 6), (1, 5)])
 c = sorted(v for u,v in cc)
 # c = [3, 4, 5, 5, 6]
  • groupby -- produces a list of lists containing identical elements (but because of the syntax needs the g for k,g in it.groupby(a) to extract each list)
  • enumerate -- appends a counter to each element of each sublist
  • flatten -- create a single list
  • set -- convert to a set
  • intersection -- find the common elements
  • sorted(v for u,v in cc) -- get rid of the counters and sort the result

Finally, I'm not sure what you mean by the remainders; it seems like it ought to be my aa-cc and bb-cc but I don't know where you get a_remainder = [4]:

sorted(v for u,v in aa-cc)
# [5]

sorted(v for u,v in bb-cc)
# [1, 4, 7]

Upvotes: 4

user625477
user625477

Reputation: 3276

A response from kerio in #python on freenode:

[ i for i in itertools.chain.from_iterable([k] * v for k, v in \
  (Counter(a) & Counter(b)).iteritems())
]

Upvotes: 2

Andrew Clark
Andrew Clark

Reputation: 208475

I don't think you should actually use this solution, but I took this opportunity to practice with lambda functions and here is what I came up with :)

a = [3,4,5,5,5,6]
b = [1,3,4,4,5,5,6,7]
dedup = lambda x: [set(x)] if len(set(x)) == len(x) else [set(x)] + dedup([x[i] for i in range(1, len(x)) if x[i] == x[i-1]])
default_set = lambda x: (set() if x[0] is None else x[0], set() if x[1] is None else x[1])
deduped = map(default_set, map(None, dedup(a), dedup(b)))
get_result = lambda f: reduce(lambda x, y: list(x) + list(y), map(lambda x: f(x[0], x[1]), deduped))
c = get_result(lambda x, y: x.intersection(y))          # [3, 4, 5, 6, 5]
a_remainder = get_result(lambda x, y: x.difference(y))  # [5]
b_remainder = get_result(lambda x, y: y.difference(x))  # [1, 7, 4]

I'm pretty sure izip_longest would have simplified this a bit (wouldn't have needed the default_set lambda), but I was testing this with Python 2.5.

Here are some of the intermediate values used in the calculation in case anyone wants to understand this:

dedup(a) = [set([3, 4, 5, 6]), set([5]), set([5])]
dedup(b) = [set([1, 3, 4, 5, 6, 7]), set([4, 5])]
deduped = [(set([3, 4, 5, 6]), set([1, 3, 4, 5, 6, 7])), (set([5]), set([4, 5])), (set([5]), set([]))]

Upvotes: -1

dfb
dfb

Reputation: 13289

In the language of sets, overlap is 'intersection' and remainder is 'set difference'. If you had distinct items, you wouldn't have to do these operations yourself, check out http://docs.python.org/library/sets.html if you're interested.

Since we're not working with distinct elements, your approach is reasonable. If you wanted this to run faster, you could create a dictionary for each list and map the number to how many elements are in each array (e.g., in a, 3->1, 4->1, 5->2, etc.). You would then iterate through map a, determine if that letter existed, decrement its count and add it to the new list

Untested code, but this is the idea

def add_or_update(map,value):
    if value in map:
        map[value]+=1
    else
        map[value]=1

b_dict = dict()
for b_elem in b:
    add_or_update(b_dict,b_elem)

intersect = []; diff = [];

for a_elem in a:
    if a_elem in b_dict and b_dict[a_elem]>0:
        intersect.add(a_elem);

for k,v in diff:
    for i in range(v):
        diff.add(k);

Upvotes: 6

kefeizhou
kefeizhou

Reputation: 6552

Use python set

intersection = set(a) & set(b)
a_remainder = set(a) - set(b)
b_remainder = set(b) - set(a)

Upvotes: 20

kirilloid
kirilloid

Reputation: 14304

Aset = Set(a);
Bset = Set(b);
a_remainder = a.difference(b);
b_remainder = b.difference(a);
c = a.intersection(b);

But if you need c to have duplicates, and order is important for you,
you may look for w:Longest common subsequence problem

Upvotes: 0

Related Questions