Finrod Felagund
Finrod Felagund

Reputation: 1293

find common elements from sublist in list

I have two list, and I have to extract the items from the first list, from which the first element is present in the second. The code I have pasted bellow works perfectly but as I am operating with several million records, it is painfully slow. Does any one have any idea how it can be optimized?

a = [[1,0],[2,0],[3,0],[4,0]]
b = [2,4,7,8]

same_nums = list(set([x[0] for x in a]).intersection(set(b)))

result = []

for i in a:
    if i[0] in same_nums:
        result.append(i)

print(result)

Upvotes: 3

Views: 1008

Answers (5)

Finrod Felagund
Finrod Felagund

Reputation: 1293

The most optimal solutions (at least my specific case) were the ones provided by schwobaseggl and Novin Shahroudi

I am currently using schwobaseggl's suggestion as I am reading the data from SQL query and Novin Shahroudi solutions requires me to do more data type conversions.

Upvotes: 1

Novin Shahroudi
Novin Shahroudi

Reputation: 680

I recommend using Numpy library as it uses C/C++ implementation under the hood so that it runs much faster.

import numpy as np

a = np.random.randint(10000, size=(10000000, 2))
b = np.random.randint(10000, size=1000)

mask = np.isin(a[:,0], b)
a_masked = a[mask, :]

# Exec time: 3.2 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The other approach (Python list) exec time for the same number of elements is as follows:

[x for x in a if x[0] in b]

# Exec time: 55.1 s ± 5.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you can see the numpy runs considerably faster. You can also turn python lists to numpy and vice-versa simply.

Upvotes: 1

Veera Balla Deva
Veera Balla Deva

Reputation: 788

a = [[1,0],[2,0],[3,0],[4,0]]
b = [2,4,7,8]
def return_common_elemens(a,b):
        for subval in a:
            if subval[0] in b:
                yield subval

print(list(return_common_elemens(a,b)))
>>>[[2, 0], [4, 0]]

Upvotes: 0

user2390182
user2390182

Reputation: 73498

You are overcomplicating things. Just turn b into a set to speed up the contains check. Then one iteration of a in the comprehension will suffice:

set_b = set(b)  # makes   vvvvvvvvvvvvv  O(1)
result = [x for x in a if x[0] in set_b]

Particular turning same_nums back into a list is a real performance killer as it makes the whole thing O(m*n) again. With a single set from b it is O(m+n). But same_nums is entirely unnecessary to begin with, since you know all the i[0] are in a as you are iterating a.

Upvotes: 4

Demosthenes
Demosthenes

Reputation: 1535

Try list comprehension with the filter instead of creating a second list with the same size and going via sets, something like:

[x for x in a if x[0] in b]

Could be faster.

In the code you posted, you do a lot of things: You create a copy of a removing the second dimension, make a set from it, intersect it with the other set, and copy the set back to a list. The big list is at least processed four times this way. My proposal only goes through the list once, so I actually expect it to be faster.

Upvotes: 4

Related Questions