Reputation: 11355
I am screenshotting several thousand web pages with pyppeteer. I discovered by accident, that running the same script in 2 open terminals doubles the output I get. I tested this by opening up to 6 terminals and running the script and I was able to get up to 6 times the performance.
I am considering using loop.run_in_executor
to run the script in multiple processes or threads from a main program.
Is this the right call or is am I hitting some IO/CPU limit in my script?
Here is how I'm thinking of doing it. I don't know if this is the right thing to do.
import asyncio
import concurrent.futures
async def blocking_io():
# File operations (such as logging) can block the
# event loop: run them in a thread pool.
with open('/dev/urandom', 'rb') as f:
return f.read(100)
async def cpu_bound():
# CPU-bound operations will block the event loop:
# in general it is preferable to run them in a
# process pool.
return sum(i * i for i in range(10 ** 7))
def wrap_blocking_io():
return asyncio.run(blocking_io())
def wrap_cpu_bound():
return asyncio.run(cpu_bound())
async def main():
loop = asyncio.get_running_loop()
# Options:
# 1. Run in the default loop's executor:
result = await loop.run_in_executor(
None, wrap_blocking_io)
print('default thread pool', result)
# 2. Run in a custom thread pool:
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as pool:
result = await loop.run_in_executor(
pool, wrap_blocking_io)
print('custom thread pool', result)
# 3. Run in a custom process pool:
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as pool:
result = await loop.run_in_executor(
pool, wrap_cpu_bound)
print('custom process pool', result)
asyncio.run(main())
Upvotes: 2
Views: 1931
Reputation: 39546
I tested this by opening up to 6 terminals and running the script and I was able to get up to 6 times the performance.
Since pyppeteer
is already asynchronous I presume you just don't run multiple browsers parallely and that's why you have increased output when you run multiple processes.
To run some coroutines concurrently ("in parallel") you usually use something like asyncio.gather. Does you code have it? If answer is no, check this example - this is how you should run multiple jobs:
responses = await asyncio.gather(*tasks)
If you already using asyncio.gather
consider to provide Minimal, Reproducible Example to make it easier to understand what happens.
Upvotes: 1