Reputation: 89
I'm working on a async code that make thousand of requests. Each request is saved in a tuple with id and response and then appended to the tasks list.
Normally i end up with a list of 4000+ tuples
After running the code I get a list like this:
responses = [(00001, {"code": 0, "foo": "bar"}), (00002, {"code": 0, "foo": "bar"}), (00003, {"code": 0, "foo": "bar"}), (00004, None), (00005, None), (00006, {"code": 0, "foo": "bar"})]
As i only need the ones with the json response, i want to delete all tuples where the second index is None
I've done a interaction in the list to append to a new list only the "valids" tuple, the ones with no None value, but it's not so performatic.
Is there a way I can delete these tuples with None without having to interact one by one?
Upvotes: 1
Views: 741
Reputation: 46
TL;DR: List comprehension performs best.
Then builtin filter
, and multiprocessing.pool
is an overkill, ie the worst.
I tested all of them in my machine, python 3.10.2, output:
$ python main.py
288.01 mks in filter_LC([(1, {'code': 0, 'foo'...) # List comprehension
469.21 mks in filter_builtin([(1, {'code': 0, 'foo'...) # Builtin filter
15.28 ms in filter_pool([(1, {'code': 0, 'foo'...) # Multiprocessing
Test code:
from multiprocessing import Pool # use process
# from multiprocessing.dummy import Pool # thread based Pool performs better than process but only slightly
from funcy import print_durations
responses = [(1, {"code": 0, "foo": "bar"}),
(2, None),
(3, {"code": 0, "foo": "bar"}),
(4, None),
(5, None),
(6, {"code": 0, "foo": "bar"})] * 1000
@print_durations
def filter_LC(responses):
return [c for c in responses if c[1] != None]
@print_durations
def filter_builtin(responses):
return list(filter(lambda c: c[1] != None, responses))
# Helpers for filter_pool()
def valid(x):
if len(x) < 2 or x[1] == None:
return False
return True
def pool_filter(pool, func, candidates):
return [c for c, keep in zip(candidates, pool.map(func, candidates)) if keep]
@print_durations
def filter_pool(responses, pool_size=5):
with Pool(pool_size) as p:
return pool_filter(p, valid, responses)
if __name__ == "__main__":
ans = [
filter_LC(responses),
filter_builtin(responses),
filter_pool(responses),
]
for a in ans:
assert a == ans[0]
List comprehension beats the builtin filter
. I guess filter
may suffer from the overhead of lambda, which list comprehension don't have.
And thread/process pool is an overkill, better save it for more time-consuming jobs rather than filtering ;)
Reference:
pool_filter()
snippet comes from How to use parallel processing filter in Python? - Stackoverflow
Upvotes: 3
Reputation: 2370
The idea behind the filter function is great and very pythonic. It only lacks the performance as it doesn't support multiprocessing because lambdas are not pickleable by default.
The map
reduce
are other alternatives, see reference here. An example of solution is:
def validate_request(request):
return True if request[1] is not None
requests = [r for r, valid in zip(requests, pool.map(validate_request, requests)) if valid]
Upvotes: 1
Reputation: 1171
You might try using Python's filter()
. For example, you could do this:
valids = filter(lambda x: x[1] is not None, responses)
where for your sample responses
variable, valids
would be
[(1, {'code': 0, 'foo': 'bar'}),
(2, {'code': 0, 'foo': 'bar'}),
(3, {'code': 0, 'foo': 'bar'}),
(6, {'code': 0, 'foo': 'bar'})]
BTW, in Python 3, leading zeros on decimal integer literals are not allowed. So this code is for Python 2.x.
Now whether this would be more performant than a list comprehension, I can't say for sure, though one blog post indicates that it may not be. Under the hood, Python is probably still interacting with the tuples one by one, but at least that's not reflected in the semantics of the code.
Upvotes: 1