송준석
송준석

Reputation: 1031

Is it possible to extract intersection list that contains duplicate values?

I want to get an intersection of lists where duplication is not eliminated. And I hope that the method is a fast way not to use loops. Below was my attempt, but this method failed because duplicates were removed.

a = ['a','b','c','f']
b = ['a','b','b','o','k']

tmp = list(set(a) & set(b))
>>>tmp
>>>['b','a']

I want the result to be ['a', 'b', 'b'].

In this method, 'a' is a fixed value and 'b' is a variable value.

And the concept of extracting 'a' value from 'b'.

Is there a way to extract a list of cross-values ​​that do not remove duplicate values?

Upvotes: 1

Views: 320

Answers (5)

6502
6502

Reputation: 114579

A solution could be

good = set(a)
result = [x for x in b if x in good]

there are two loops here; one is the set-building loop of set (that is implemented in C, a hundred of times faster than whatever you can do in Python) the other is the comprehension and runs in the interpreter. The first loop is done to avoid a linear search in a for each element of b (if a becomes big this can be a serious problem).

Note that using filter instead is probably not going to gain much (if anything) because despite the filter loop being in C, for each element it will have to get back to the interpreter to call the filtering function.

Note that if you care about speed then probably Python is not a good choice... for example may be PyPy would be better here and in this case just writing an optimal algorithm explicitly should be ok (avoiding re-searching a for duplicates when they are consecutive in b like happens in your example)

good = set(a)
res = []
i = 0
while i < len(b):
    x = b[i]
    if x in good:
        while i < len(b) and b[i] == x:  # is?
            res.append(x)
            i += 1
    else:
        i += 1

Of course in performance optimization the only real way is try and measure with real data on the real system... guessing works less and less as technology advances and becomes more complicated.

Upvotes: 2

It isn't clear how duplicates are handled when performing an intersection of lists which contain duplicate elements, as you have given only one test case and its expected result, and you did not explain duplicate handling.

According to how keeping duplicates work currently, the common elements are 'a' and 'b', and the intersection list lists 'a' with multiplicity 1 and 'b' with multiplicity 2. Note 'a' occurs once on both lists a and b, but 'b' occurs twice on b. The intersection list lists the common element with multiplicity equal to the list having that element at the maximum multiplicity.

The answer is yes. However, a loop may implicitly be called - though you want your code to not explicitly use any loop statements. This algorithm, however, will always be iterative.

Step 1: Create the intersection set, Intersect that does not contain duplicates (You already done that). Convert to list to keep indexing.

Step 2: Create a second array, IntersectD. Create a new variable Freq which counts the maximum number of occurrences for that common element, using count. Use Intersect and Freq to append the element Intersect[k] a number of times depending on its corresponding Freq[k].

An example code with 3 lists would be

a = ['a','b','c','1','1','1','1','2','3','o']
b = ['a','b','b','o','1','o','1']
c = ['a','a','a','b','1','2']

intersect = list(set(a) & set(b) & set(c)) # 3-set case
intersectD = []

for k in range(len(intersect)):
  cmn = intersect[k]
  freq = max(a.count(cmn), b.count(cmn), c.count(cmn)) # 3-set case
  for i in range(freq): # Can be done with itertools
    intersectD.append(cmn)

>>> intersectD
>>> ['b', 'b', 'a', 'a', 'a', '1', '1', '1', '1']

For cases involving more than two lists, freq for this common element can be computed using a more complex set intersection and max expression. If using a list of lists, freq can be computed using an inner loop. You can also replace the inner i-loop with an itertools expression from How can I count the occurrences of a list item?.

Upvotes: 1

iGian
iGian

Reputation: 11193

I guess it's not faster than a loop and finally you probably still need a loop to extract the result. Anyway...

from collections import Counter

a = ['a','a','b','c','f']
b = ['a','b','b','o','k']

count_b = Counter(b)
count_ab = Counter(set(b)-set(a))
count_b - count_ab

#=> Counter({'a': 1, 'b': 2})


I mean, if res holds the result, you need to:

[ val for sublist in [ [s] * n for s, n in res.items() ] for val in sublist ]
#=> ['a', 'b', 'b']

Upvotes: 1

Justice_Lords
Justice_Lords

Reputation: 969

>>a = ['a','b','c','f']
>>b = ['a','b','b','o','k']
>>items = set(a)
>>found = [i for i in b if i in items]
>>items
{'f', 'a', 'c', 'b'}
>>found
['a', 'b', 'b']

This should do your work.

Upvotes: 1

EliadL
EliadL

Reputation: 7088

If you insist on not using for explicitly then this will work:

>>> list(filter(a.__contains__, b))
['a', 'b', 'b']

But directly calling magic methods like __contains__ is not a recommended practice to the best of my knowledge, so consider this instead:

>>> list(filter(lambda x: x in a, b))
['a', 'b', 'b']

And if you want to improve the lookup in a from O(n) to O(1) then create a set of it first:

>>> a_set = set(a)
>>> list(filter(lambda x: x in a_set, b))
['a', 'b', 'b']

Upvotes: 1

Related Questions