Reputation: 1293
I have two list, and I have to extract the items from the first list, from which the first element is present in the second. The code I have pasted bellow works perfectly but as I am operating with several million records, it is painfully slow. Does any one have any idea how it can be optimized?
a = [[1,0],[2,0],[3,0],[4,0]]
b = [2,4,7,8]
same_nums = list(set([x[0] for x in a]).intersection(set(b)))
result = []
for i in a:
if i[0] in same_nums:
result.append(i)
print(result)
Upvotes: 3
Views: 1008
Reputation: 1293
The most optimal solutions (at least my specific case) were the ones provided by schwobaseggl and Novin Shahroudi
I am currently using schwobaseggl's suggestion as I am reading the data from SQL query and Novin Shahroudi solutions requires me to do more data type conversions.
Upvotes: 1
Reputation: 680
I recommend using Numpy library as it uses C/C++ implementation under the hood so that it runs much faster.
import numpy as np
a = np.random.randint(10000, size=(10000000, 2))
b = np.random.randint(10000, size=1000)
mask = np.isin(a[:,0], b)
a_masked = a[mask, :]
# Exec time: 3.2 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The other approach (Python list) exec time for the same number of elements is as follows:
[x for x in a if x[0] in b]
# Exec time: 55.1 s ± 5.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see the numpy
runs considerably faster. You can also turn python lists to numpy and vice-versa simply.
Upvotes: 1
Reputation: 788
a = [[1,0],[2,0],[3,0],[4,0]]
b = [2,4,7,8]
def return_common_elemens(a,b):
for subval in a:
if subval[0] in b:
yield subval
print(list(return_common_elemens(a,b)))
>>>[[2, 0], [4, 0]]
Upvotes: 0
Reputation: 73498
You are overcomplicating things. Just turn b
into a set
to speed up the contains check. Then one iteration of a
in the comprehension will suffice:
set_b = set(b) # makes vvvvvvvvvvvvv O(1)
result = [x for x in a if x[0] in set_b]
Particular turning same_nums
back into a list
is a real performance killer as it makes the whole thing O(m*n)
again. With a single set from b
it is O(m+n)
. But same_nums
is entirely unnecessary to begin with, since you know all the i[0]
are in a
as you are iterating a
.
Upvotes: 4
Reputation: 1535
Try list comprehension with the filter instead of creating a second list with the same size and going via sets, something like:
[x for x in a if x[0] in b]
Could be faster.
In the code you posted, you do a lot of things: You create a copy of a removing the second dimension, make a set from it, intersect it with the other set, and copy the set back to a list. The big list is at least processed four times this way. My proposal only goes through the list once, so I actually expect it to be faster.
Upvotes: 4